1 Overview

This is the online appendix of the paper

  • Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, Paul-Christian Bürkner (2020): Rank-normalization, folding, and localization: An improved \(\widehat{R}\) for assessing convergence of MCMC. arXiv preprint arXiv:1903.08008.

The code for the paper is available on Github (https://github.com/avehtari/rhat_ess). Here, we introduce all the code related to the examples presented in the paper and more numerical experiments not discussed in the paper itself.

To help you finding your way through all the examples presented in this online appendix, below please find a list of links to the examples discussed in the paper:

2 Examples

In this section, we will go through some examples to demonstrate the usefulness of our proposed methods as well as the associated workflow in determining convergence. Appendices A-D contain more detailed analysis of different algorithm variants and further examples.

First, we load all the necessary R packages and additional functions.

library(tidyverse)
library(gridExtra)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
library(rjags)
library(abind)
source('monitornew.R')
source('monitorplot.R')

2.1 Cauchy: A distribution with infinite mean and variance

This section relates to the examples presented in Section 5.1 of the paper.

Traditional \(\widehat{R}\) is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is infinite, this approach is not well justified, as we demonstrate here with a Cauchy-distributed example. The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution. Appendix B contains more detailed analysis of different algorithm variants and further Cauchy examples.

2.1.1 Nominal parameterization of Cauchy

The nominal Cauchy model with direct parameterization is as follows:

writeLines(readLines("cauchy_nom.stan"))
parameters {
  vector[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

2.1.1.1 Default Stan options

Run the nominal model:

fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)
Warning: There were 1421 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

We get HMC specific diagnostic (Betancourt, 2017) warnings about a very large number of transitions after warmup that exceed the maximum treedepth and low estimated Bayesian fraction of missing information, indicating slow mixing likely due to very long tails of the Cauchy distribution.

mon <- monitor(fit_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

          Q5    Q50    Q95   Mean    SD  Rhat Bulk_ESS Tail_ESS
x[1]   -5.87   0.02   6.78   0.92 13.64  1.01     1968      360
x[2]   -5.76  -0.03   5.11  -0.27  7.53  1.00     2419      801
x[3]   -6.12   0.00   7.73   0.77 10.28  1.01      449       82
x[4]   -5.30  -0.01   5.41  -0.06  7.19  1.01     2267      696
x[5]  -13.14  -0.04   5.83  -2.78 17.58  1.03      169       45
x[6]   -8.10  -0.06   6.39  -0.92 15.59  1.01     1612      401
x[7]   -7.25  -0.08   7.55  -0.29  9.68  1.01     1229      462
x[8]   -4.79   0.02   6.23   0.57  8.65  1.02     1568      431
x[9]   -4.69   0.01   4.82   0.13  5.43  1.00     2608      783
x[10]  -8.43  -0.04   6.50  -1.92 32.96  1.01     1607      261
x[11]  -6.21   0.02   7.09   0.43 16.21  1.01     2463      473
x[12]  -5.55   0.02   6.62   0.17  6.37  1.00     2814      403
x[13]  -6.16   0.00   6.13   0.44 12.78  1.01     2282      564
x[14]  -5.45   0.02   6.79   0.29  8.44  1.01     3123      646
x[15]  -8.17   0.05   9.55   0.26 30.61  1.01     2129      521
x[16]  -5.83   0.00   6.23   0.31  8.60  1.01     2178      588
x[17] -18.10  -0.07   7.39  -6.89 47.38  1.03      209       66
x[18]  -5.89   0.05   5.90   0.10  5.42  1.01     3337      663
x[19]  -7.04   0.03   6.23  -0.23 12.26  1.01     1433      420
x[20]  -6.58   0.03   7.53  -0.19  7.81  1.02      952      445
x[21]  -8.11   0.01   6.57  -2.16 28.47  1.00     1794      396
x[22]  -5.47   0.04   5.24   0.25  8.30  1.00     2844      711
x[23] -10.04  -0.05   6.55 -11.16 87.97  1.04      318       47
x[24] -13.18  -0.07   5.72  -4.58 30.89  1.01      441       98
x[25]  -6.89  -0.02   5.60  -0.44  8.76  1.02     1894      490
x[26]  -5.33  -0.02   4.69  -0.30  6.55  1.01     3099      731
x[27]  -5.19  -0.04   5.17  -0.26  8.50  1.00     3097      873
x[28]  -6.81  -0.02   6.81   0.15  7.71  1.00     4185      923
x[29]  -9.96  -0.02   6.99  -1.66 18.87  1.01      935      222
x[30]  -6.20   0.02   6.62   0.44 14.44  1.01     3502      558
x[31] -18.84  -0.04   5.77  -4.22 29.48  1.02      210       49
x[32]  -6.33  -0.01   6.21  -0.02  6.73  1.02     3042      747
x[33]  -4.81   0.03   4.93   0.22  7.89  1.01     3506     1007
x[34]  -5.75   0.02   5.75  -0.12  9.23  1.00     1899      660
x[35]  -5.98   0.03   7.16   0.06  8.68  1.01     2706      638
x[36]  -7.37   0.00   9.19   0.08 10.82  1.01     1404      399
x[37]  -5.03   0.02   7.82   2.52 29.21  1.01     2028      356
x[38]  -5.40  -0.01   5.59  -0.06  6.64  1.01     3612      740
x[39]  -4.78  -0.02   5.36   0.27  6.40  1.00     2858      756
x[40]  -6.16   0.00   4.58  -0.91 10.45  1.01     1861      457
x[41]  -5.46   0.07   6.55   0.05  6.06  1.01     2328      619
x[42]  -7.63  -0.05   6.38  -0.75 10.10  1.01     1797      512
x[43]  -7.47  -0.04   6.39  -0.59  8.32  1.01     1877      557
x[44]  -5.22   0.00   5.17   0.11  6.15  1.01     1613      574
x[45]  -4.44   0.04  13.98   3.19 17.67  1.01      290       64
x[46]  -5.04   0.01   5.60   0.22  9.06  1.02     3154      398
x[47]  -5.87   0.03   5.46   0.19 11.73  1.02     2263      497
x[48]  -4.99  -0.03   4.91   0.05 10.22  1.01     2920      723
x[49]  -6.42   0.01  10.78  10.69 83.66  1.02      214       43
x[50]  -6.24   0.02   5.71   0.41 19.28  1.00     2617      786
I       0.00   1.00   1.00   0.51  0.50  1.00      687     4000
lp__  -88.54 -68.47 -50.77 -69.10 11.75  1.01      252      453

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).
which_min_ess <- which.min(mon[1:50, 'Tail_ESS'])

Several values of Rhat greater than 1.01 and some ESS less than 400 also indicate convergence propblems. The Appendix B contains more results with longer chains.

We can further analyze potential problems using local efficiency and rank plots. We specifically investigate x[49], which, in this specific run, had the smallest tail-ESS of 43.

We examine the sampling efficiency in different parts of the posterior by computing the efficiency of small interval probability estimates (see Section 4.3 in the paper). Each interval contains \(1/k\) of the draws (e.g., \(5\%\) each, if \(k=20\)). The small interval efficiency measures mixing of an function which indicates when the values are inside or outside the specific small interval. As detailed above, this gives us a local efficiency measure which does not depend on the shape of the distribution.

plot_local_ess(fit = fit_nom, par = which_min_ess, nalpha = 20)

The efficiency of sampling is low in the tails, which is clearly caused by slow mixing in long tails of the Cauchy distribution.
Orange ticks show iterations that exceeded the maximum treedepth.

An alternative way to examine the efficiency in different parts of the posterior is to compute efficiency estimates for quantiles (see Section 4.3 in the paper). Each interval has a specified proportion of draws, and the efficiency measures mixing of a function which indicates when the values are smaller than or equal to the corresponding quantile.

plot_quantile_ess(fit = fit_nom, par = which_min_ess, nalpha = 40)

Similar as above, we see that the efficiency of sampling is low in the tails. Orange ticks show iterations that exceeded the maximum treedepth.

We may also investigate how the estimated effective sample sizes change when we use more and more draws; Brooks and Gelman (1998) proposed to use similar graph for \(\widehat{R}\). If the effective sample size is highly unstable, does not increase proportionally with more draws, or even decreases, this indicates that simply running longer chains will likely not solve the convergence issues. In the plot below, we see how unstable both bulk-ESS and tail-ESS are for this example.

plot_change_ess(fit = fit_nom, par = which_min_ess)

We can further analyze potential problems using rank plots which clearly show the mixing problem between chains. In case of good mixing all rank plots should be close to uniform.

samp <- as.array(fit_nom)
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

2.1.2 Alternative parameterization of Cauchy

Next, we examine an alternative parameterization of the Cauchy as a scale mixture of Gaussians. The model has two parameters, and the Cauchy distributed \(x\) can be computed deterministically from those. In addition to improved sampling performance, the example illustrates that focusing on diagnostics matters.

writeLines(readLines("cauchy_alt_1.stan"))
parameters {
  vector[50] x_a;
  vector<lower=0>[50] x_b;
}

transformed parameters {
  vector[50] x = x_a ./ sqrt(x_b);
}

model {
  x_a ~ normal(0, 1);
  x_b ~ gamma(0.5, 0.5);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run the alternative model:

fit_alt1 <- stan(file = 'cauchy_alt_1.stan', seed = 7878, refresh = 0)

There are no warnings, and the sampling is much faster.

mon <- monitor(fit_alt1)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

            Q5    Q50    Q95   Mean     SD  Rhat Bulk_ESS Tail_ESS
x_a[1]   -1.68  -0.03   1.60  -0.02   1.00     1     4259     3066
x_a[2]   -1.59  -0.01   1.71   0.02   1.00     1     3649     3147
x_a[3]   -1.62   0.00   1.61   0.01   1.00     1     4088     2888
x_a[4]   -1.66  -0.02   1.58  -0.02   0.99     1     4144     3084
x_a[5]   -1.69  -0.01   1.72  -0.01   1.02     1     4573     2516
x_a[6]   -1.67   0.01   1.71   0.00   1.02     1     4279     2993
x_a[7]   -1.69   0.03   1.71   0.02   1.01     1     4071     2908
x_a[8]   -1.61   0.01   1.64   0.01   0.99     1     3999     2887
x_a[9]   -1.71   0.02   1.64   0.01   1.00     1     3866     2988
x_a[10]  -1.63  -0.01   1.60  -0.01   1.00     1     3931     2747
x_a[11]  -1.60  -0.04   1.61  -0.01   0.99     1     4216     3576
x_a[12]  -1.64   0.01   1.64   0.00   1.00     1     3673     2687
x_a[13]  -1.70  -0.01   1.68  -0.01   1.02     1     3886     2906
x_a[14]  -1.69   0.03   1.66   0.02   1.00     1     3959     2727
x_a[15]  -1.63  -0.01   1.67   0.00   1.00     1     3742     3186
x_a[16]  -1.65  -0.01   1.65   0.00   1.00     1     4343     3249
x_a[17]  -1.66  -0.02   1.67   0.00   1.01     1     4346     2879
x_a[18]  -1.56   0.01   1.55   0.00   0.95     1     3918     2634
x_a[19]  -1.61  -0.02   1.65  -0.02   0.99     1     4047     2884
x_a[20]  -1.64   0.01   1.72   0.01   1.01     1     4213     2879
x_a[21]  -1.62  -0.01   1.58  -0.01   0.98     1     4314     2547
x_a[22]  -1.61   0.00   1.61  -0.01   0.98     1     3873     3010
x_a[23]  -1.69  -0.02   1.62  -0.03   1.00     1     3857     3066
x_a[24]  -1.61  -0.02   1.61  -0.01   0.98     1     3458     3012
x_a[25]  -1.65   0.01   1.64   0.00   1.01     1     4311     2983
x_a[26]  -1.60   0.05   1.64   0.03   0.98     1     3960     3114
x_a[27]  -1.64  -0.02   1.66  -0.02   1.01     1     3585     2797
x_a[28]  -1.62   0.01   1.59   0.01   0.99     1     3899     2890
x_a[29]  -1.60   0.02   1.58   0.02   0.97     1     4385     3172
x_a[30]  -1.65   0.02   1.70   0.02   1.01     1     4215     3111
x_a[31]  -1.64   0.00   1.64   0.00   0.98     1     3863     2660
x_a[32]  -1.67  -0.01   1.65   0.00   1.00     1     4340     2691
x_a[33]  -1.65  -0.02   1.59  -0.01   0.99     1     4232     2779
x_a[34]  -1.67   0.02   1.69   0.01   1.00     1     4170     3165
x_a[35]  -1.62   0.00   1.58  -0.01   0.98     1     3703     2630
x_a[36]  -1.68   0.01   1.65   0.00   0.99     1     3897     2907
x_a[37]  -1.60   0.03   1.66   0.02   0.99     1     4487     2498
x_a[38]  -1.69   0.01   1.72   0.01   1.03     1     4029     2708
x_a[39]  -1.64  -0.01   1.66  -0.01   1.01     1     4406     2788
x_a[40]  -1.69  -0.02   1.60  -0.03   1.01     1     4145     2931
x_a[41]  -1.64   0.02   1.65   0.01   1.00     1     4191     3140
x_a[42]  -1.65  -0.02   1.61  -0.03   0.99     1     3664     3171
x_a[43]  -1.62  -0.02   1.64  -0.01   1.00     1     3792     2289
x_a[44]  -1.65   0.02   1.76   0.03   1.03     1     3759     3034
x_a[45]  -1.69   0.04   1.71   0.02   1.03     1     3856     2971
x_a[46]  -1.67  -0.01   1.70   0.01   1.02     1     4204     3087
x_a[47]  -1.62  -0.03   1.65  -0.01   1.01     1     3883     3010
x_a[48]  -1.64   0.01   1.67   0.01   1.01     1     4372     3305
x_a[49]  -1.64   0.02   1.64   0.01   0.99     1     3826     3136
x_a[50]  -1.65  -0.01   1.63  -0.01   1.01     1     3732     2689
x_b[1]    0.00   0.41   3.83   1.00   1.43     1     2375     1407
x_b[2]    0.00   0.47   3.73   1.00   1.39     1     2246     1422
x_b[3]    0.00   0.47   3.76   1.00   1.38     1     2845     1858
x_b[4]    0.01   0.49   3.95   1.02   1.37     1     2758     1600
x_b[5]    0.00   0.49   3.89   1.04   1.48     1     3379     1878
x_b[6]    0.00   0.46   3.81   1.00   1.38     1     2596     1388
x_b[7]    0.01   0.47   4.07   1.02   1.44     1     2834     1440
x_b[8]    0.01   0.45   3.75   1.01   1.42     1     2565     1184
x_b[9]    0.00   0.44   3.83   1.00   1.42     1     2789     1268
x_b[10]   0.00   0.48   3.78   1.01   1.40     1     2694     1390
x_b[11]   0.00   0.45   3.87   0.99   1.36     1     2811     1684
x_b[12]   0.00   0.42   3.94   1.00   1.44     1     2552     1378
x_b[13]   0.00   0.45   3.94   1.02   1.44     1     2507     1292
x_b[14]   0.00   0.44   3.71   0.97   1.39     1     2965     1636
x_b[15]   0.00   0.43   3.98   1.02   1.45     1     2517     1372
x_b[16]   0.01   0.49   3.62   1.00   1.38     1     2401     1410
x_b[17]   0.00   0.47   3.86   1.01   1.39     1     2429     1458
x_b[18]   0.00   0.46   3.79   0.98   1.38     1     2179     1356
x_b[19]   0.00   0.44   3.81   0.99   1.39     1     2983     1730
x_b[20]   0.00   0.48   3.61   0.97   1.33     1     3158     1685
x_b[21]   0.00   0.47   3.74   0.98   1.35     1     2894     1499
x_b[22]   0.00   0.45   3.85   1.02   1.48     1     2445     1466
x_b[23]   0.00   0.46   4.02   1.02   1.44     1     2780     1916
x_b[24]   0.00   0.46   3.79   1.00   1.45     1     2552     1440
x_b[25]   0.00   0.49   3.84   1.01   1.41     1     2514     1293
x_b[26]   0.00   0.44   4.08   1.03   1.51     1     3286     1665
x_b[27]   0.00   0.43   3.75   0.99   1.38     1     2448     1434
x_b[28]   0.00   0.45   3.77   1.00   1.43     1     2635     1291
x_b[29]   0.00   0.42   3.80   0.97   1.40     1     2567     1490
x_b[30]   0.00   0.45   3.82   0.99   1.42     1     2506     1630
x_b[31]   0.00   0.43   3.84   0.99   1.37     1     2814     1814
x_b[32]   0.00   0.45   3.75   1.00   1.41     1     2410     1350
x_b[33]   0.01   0.46   3.81   0.99   1.38     1     2328     1404
x_b[34]   0.00   0.46   3.89   1.01   1.49     1     2840     1490
x_b[35]   0.00   0.41   3.71   0.97   1.42     1     2536     1578
x_b[36]   0.00   0.44   3.60   0.97   1.33     1     2483     1796
x_b[37]   0.00   0.47   3.82   1.01   1.42     1     2794     1641
x_b[38]   0.00   0.47   4.12   1.04   1.49     1     2648     1490
x_b[39]   0.00   0.45   3.96   1.02   1.44     1     2594     1561
x_b[40]   0.00   0.45   3.68   0.99   1.40     1     2461     1480
x_b[41]   0.00   0.48   3.76   1.00   1.38     1     2772     1636
x_b[42]   0.00   0.46   3.95   1.01   1.43     1     2268     1181
x_b[43]   0.00   0.48   3.85   1.02   1.42     1     2522     1264
x_b[44]   0.00   0.46   3.50   0.97   1.32     1     2704     1475
x_b[45]   0.00   0.45   4.01   1.04   1.49     1     2844     1640
x_b[46]   0.00   0.45   3.94   1.00   1.42     1     2485     1656
x_b[47]   0.00   0.49   3.75   1.01   1.40     1     2536     1297
x_b[48]   0.00   0.46   3.79   0.99   1.42     1     2349     1271
x_b[49]   0.00   0.43   3.85   1.00   1.40     1     2499     1447
x_b[50]   0.00   0.46   3.78   1.01   1.44     1     2777     1411
x[1]     -7.64  -0.04   7.33  -0.69  32.90     1     3894     2155
x[2]     -6.02  -0.01   6.41   4.16 362.61     1     3510     1729
x[3]     -6.54   0.00   6.33   0.19  33.51     1     3497     2545
x[4]     -5.52  -0.02   5.74  -3.14 191.65     1     3773     2312
x[5]     -6.20  -0.01   6.29  -0.47  21.58     1     4131     2629
x[6]     -5.55   0.01   6.40   0.19  91.87     1     4172     2474
x[7]     -5.73   0.05   6.63  -0.03  14.48     1     3750     2165
x[8]     -5.80   0.01   5.58  -1.23  53.41     1     3656     2399
x[9]     -5.96   0.03   6.01  -6.47 228.29     1     3568     2182
x[10]    -5.73  -0.01   6.11  -1.49  88.67     1     3612     1969
x[11]    -6.35  -0.05   6.08   0.98  48.99     1     3932     2517
x[12]    -6.59   0.02   5.94  -0.79  30.31     1     3392     2203
x[13]    -6.36  -0.01   6.24  -6.55 221.32     1     3898     2279
x[14]    -6.64   0.04   6.51   3.57 149.72     1     3697     2531
x[15]    -7.28  -0.01   5.90  -0.32  18.93     1     3696     2034
x[16]    -5.65  -0.01   5.28   0.37  32.30     1     3911     2341
x[17]    -6.10  -0.03   6.17  -0.33  26.14     1     4000     2160
x[18]    -5.74   0.02   6.79   3.13  94.50     1     3862     2078
x[19]    -6.01  -0.03   6.04   0.35  18.39     1     3761     2329
x[20]    -5.56   0.01   6.22  -0.07  30.58     1     3876     2338
x[21]    -6.21  -0.01   6.22   0.33  35.50     1     3907     2388
x[22]    -6.23   0.00   6.02  -8.18 356.87     1     3683     1769
x[23]    -6.17  -0.02   6.09  -0.37  14.57     1     3858     2628
x[24]    -6.25  -0.02   5.56  -0.61  81.96     1     3134     2012
x[25]    -6.72   0.02   5.47   1.85 202.62     1     3898     2318
x[26]    -5.61   0.06   5.81  -0.34  34.70     1     3874     2846
x[27]    -7.91  -0.03   6.89   6.55 323.96     1     3694     2564
x[28]    -6.29   0.01   6.80   0.35  31.31     1     3827     2203
x[29]    -6.57   0.02   6.30  -0.41  23.66     1     3796     2357
x[30]    -5.92   0.03   6.31  -0.31  38.53     1     3879     2158
x[31]    -5.80   0.00   6.28   0.36  35.66     1     3701     2565
x[32]    -6.10  -0.02   6.39  -3.81 298.91     1     4374     2445
x[33]    -6.00  -0.03   5.39   1.22  81.47     1     3641     2292
x[34]    -5.87   0.04   6.56   0.09  30.46     1     3855     2568
x[35]    -6.81   0.00   6.06  -0.07  27.08     1     3461     2257
x[36]    -6.12   0.01   6.09  -0.81  40.91     1     3737     2638
x[37]    -5.92   0.03   6.50  -6.16 263.64     1     4237     2370
x[38]    -6.46   0.01   6.00   0.60  58.15     1     3707     2083
x[39]    -6.79  -0.01   6.27  -1.86  64.47     1     4010     2317
x[40]    -6.97  -0.02   5.62   0.25 146.44     1     3776     2196
x[41]    -6.12   0.02   5.98  -0.88  67.65     1     3948     2553
x[42]    -7.48  -0.03   6.42 -18.09 643.89     1     3585     2152
x[43]    -5.99  -0.02   6.57  -0.40  52.47     1     3489     2329
x[44]    -6.09   0.03   5.93   0.44  26.31     1     3772     2408
x[45]    -6.25   0.04   6.65  -0.18  15.75     1     3732     2346
x[46]    -5.88  -0.02   7.59   0.75  28.36     1     3577     2171
x[47]    -6.18  -0.03   5.82   0.24  79.45     1     3776     2192
x[48]    -6.61   0.01   6.03  -3.56 129.17     1     3800     2090
x[49]    -5.88   0.02   6.52  -0.30  46.02     1     3776     2549
x[50]    -7.08   0.00   6.32  -0.71  45.99     1     3488     2383
I         0.00   0.00   1.00   0.49   0.50     1     2906     4000
lp__    -95.29 -81.24 -69.52 -81.62   7.89     1     1518     2678

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).
which_min_ess <- which.min(mon[101:150, 'Tail_ESS'])

For all parameters, Rhat is less than 1.01 and ESS exceeds 400, indicating that sampling worked much better with this alternative parameterization. Appendix B has more results using other parameterizations of the Cauchy distribution. The vectors x_a and x_b used to form the Cauchy-distributed x have stable quantile, mean and sd values. The quantiles of each x_j are stable too, but the mean and variance estimates are widely varying.

We can further analyze potential problems using local efficiency estimates and rank plots. For this example. we take a detailed look at x[2], which had the smallest bulk-ESS of 3147.

We examine the sampling efficiency in different parts of the posterior by computing the efficiency estimates for small interval probability estimates.

plot_local_ess(fit = fit_alt1, par = which_min_ess + 100, nalpha = 20)

The efficiency estimate is good in all parts of the posterior. Further, we examine the sampling efficiency of different quantile estimates.

plot_quantile_ess(fit = fit_alt1, par = which_min_ess + 100, nalpha = 40)

The rank plots also look close to uniform across chains, which is consistent with good mixing.

samp <- as.array(fit_alt1)
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

In summary, the alternative parameterization produces results that look much better than for the nominal parameterization. There are still some differences in the tails, which we also identified via the tail-ESS.

2.1.3 Half-Cauchy with nominal parameterization

Half-Cauchy priors for non-negative parameters are common and, at least in Stan, usually specified via the nominal parameterization. In this example, we set independent half-Cauchy distributions on each element of the 50-dimensional vector \(x\) constrained to be positive (in Stan, <lower=0>). Stan then automatically switches to the unconstrained log(x) space, which changes the geometry crucially.

writeLines(readLines("half_cauchy_nom.stan"))
parameters {
  vector<lower=0>[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run the half-Cauchy with nominal parameterization and positive constraint:

fit_half_nom <- stan(file = 'half_cauchy_nom.stan', seed = 7878, refresh = 0)

There are no warnings, and the sampling is much faster than for the Cauchy nominal model without the constraint.

mon <- monitor(fit_half_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

          Q5    Q50    Q95   Mean      SD  Rhat Bulk_ESS Tail_ESS
x[1]    0.08   1.01  14.11   7.41  128.72  1.00     6707     1805
x[2]    0.08   0.99  13.81   9.58  159.58  1.00     5993     1755
x[3]    0.07   0.97  13.53   4.72   30.32  1.00     7473     2288
x[4]    0.09   0.99  11.20   4.99   43.10  1.00     7936     2114
x[5]    0.07   1.01  14.51   5.05   47.10  1.00     7577     2314
x[6]    0.08   1.02  13.91  91.28 5383.88  1.00     7497     2070
x[7]    0.08   0.99  13.46   6.57   93.25  1.00     7717     2008
x[8]    0.09   1.00  11.29   4.84   34.92  1.00     9360     2095
x[9]    0.07   0.99  13.40  10.80  162.43  1.00     7536     1954
x[10]   0.08   1.02  12.53   5.61   49.28  1.00     7460     2243
x[11]   0.09   1.00  10.67   6.78  116.46  1.01     7359     2428
x[12]   0.08   1.01  12.34   5.46   47.09  1.00     8042     2038
x[13]   0.08   1.01  11.84   4.77   62.16  1.00     7179     2074
x[14]   0.08   1.01  11.97   4.99   65.23  1.00     8496     2440
x[15]   0.08   0.98  13.84   6.31   61.42  1.00     8156     2053
x[16]   0.08   0.98  11.59   6.25   67.00  1.00     7641     2345
x[17]   0.08   0.97  11.95   3.92   19.12  1.00     6994     2276
x[18]   0.08   1.00  10.98   4.05   23.61  1.00     7002     2297
x[19]   0.08   1.03  12.08  38.38 2005.23  1.00     7617     2394
x[20]   0.07   1.00  14.24   5.68   39.23  1.00     8368     2058
x[21]   0.08   1.01  11.88   6.11  110.22  1.00     8168     2795
x[22]   0.09   1.02  11.18   7.01  169.77  1.00     7621     1998
x[23]   0.07   1.00  14.22   7.35   94.96  1.00     7423     2099
x[24]   0.07   0.99  12.30   6.63  130.07  1.00     7156     2010
x[25]   0.09   1.01  11.29   9.22  343.88  1.00     7965     2303
x[26]   0.07   1.05  13.90   7.72  130.68  1.00     7080     2412
x[27]   0.07   0.99  14.12   5.75   52.77  1.00     8388     1981
x[28]   0.07   1.00  15.56   7.49   73.53  1.00     8576     2074
x[29]   0.08   1.00  13.28   5.13   37.83  1.00     7558     2029
x[30]   0.06   0.99  14.14   7.69  139.13  1.00     6453     2051
x[31]   0.08   1.04  16.07   6.73   50.89  1.00     7135     2140
x[32]   0.09   1.01  13.23   5.32   43.21  1.00     7652     2525
x[33]   0.08   1.00  12.96   6.26   92.50  1.00     7324     2177
x[34]   0.09   0.97  11.01   3.78   23.01  1.00     7435     2393
x[35]   0.07   1.00  13.56   5.08   30.58  1.00     6841     2457
x[36]   0.07   1.01  13.18   6.52   80.50  1.00     7537     1798
x[37]   0.09   1.03  11.61   4.16   23.07  1.00     7235     2361
x[38]   0.08   0.99  15.73   5.35   58.79  1.00     7507     1876
x[39]   0.07   1.00  13.93  31.05  827.68  1.00     8020     1704
x[40]   0.07   1.03  12.88  10.80  173.36  1.00     7307     1770
x[41]   0.07   0.97  13.42   7.83   96.96  1.00     7211     1962
x[42]   0.09   0.99  11.94   8.61  152.58  1.00     7535     2438
x[43]   0.09   1.01  12.12   4.59   52.98  1.00     7387     2445
x[44]   0.08   1.02  12.66   4.62   34.61  1.00     6434     2010
x[45]   0.07   1.00  13.25  20.04  914.15  1.00     6968     2098
x[46]   0.07   1.03  14.46   4.87   27.22  1.00     7827     2130
x[47]   0.08   1.00  13.13   6.86   99.30  1.00     6559     2707
x[48]   0.07   1.02  12.49   6.80   87.98  1.00     7787     2588
x[49]   0.08   1.00  12.68   5.54   37.20  1.00     7918     2135
x[50]   0.10   0.99  11.95   8.59  155.18  1.00     8065     1714
I       0.00   0.00   1.00   0.50    0.50  1.00     5807     4000
lp__  -80.63 -69.45 -59.66 -69.65    6.45  1.00     1143     2006

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).

All values of Rhat are less than 1.01 and ESS exceeds 400 for all parameters, indicating good performance of the sampler despite using the nominal parameterization of the Cauchy distribution. More experiments with the half-Cauchy distribution can be found in Appendix B.

2.2 Hierarchical model: Eight Schools

This section relates to the examples presented in Section 5.2 of the paper.

The Eight Schools data is a classic example for hierarchical models (see Section 5.5 in Gelman et al., 2013), which even in its simplicity illustrates the typical problems in inference for hierarchical models. The Stan models are adapted from Michael Betancourt’s case study on Diagnosing Biased Inference with Divergences. Appendix C contains more detailed analysis of different algorithm variants.

2.2.1 A Centered Eight Schools model

writeLines(readLines("eight_schools_cp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta[J];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta ~ normal(mu, tau);
  y ~ normal(theta, sigma);
}

2.2.1.1 Centered Eight Schools model

We directly run the centered parameterization model with an increased adapt_delta value to reduce the probability of getting divergent transitions.

eight_schools <- read_rdump("eight_schools.data.R")
fit_cp <- stan(
  file = 'eight_schools_cp.stan', data = eight_schools,
  iter = 2000, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)
Warning: There were 28 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems

Still, we observe a lot of divergent transitions, which in itself is already sufficient indicator of convergence problems. We can use Rhat and ESS diagnostics to recognize problematic parts of the posterior. The latter two have the advantage over the divergent transitions diagnostic that they can be used with all MCMC algorithms not only with HMC.

mon <- monitor(fit_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

             Q5    Q50   Q95   Mean   SD  Rhat Bulk_ESS Tail_ESS
mu        -0.93   4.53  9.85   4.50 3.26  1.01      696     1123
tau        0.71   3.02  9.79   3.86 3.12  1.02      219      232
theta[1]  -1.24   5.88 16.01   6.41 5.52  1.01      991     1593
theta[2]  -2.44   4.99 12.84   5.05 4.73  1.00     1250     2037
theta[3]  -4.99   4.32 12.13   4.06 5.36  1.00     1168     1790
theta[4]  -2.91   4.88 12.73   4.89 4.77  1.00     1030     1908
theta[5]  -3.99   3.95 10.80   3.75 4.59  1.00      956     2015
theta[6]  -3.97   4.38 11.45   4.16 4.71  1.01     1197     1704
theta[7]  -0.91   6.04 15.06   6.44 4.99  1.00      936     1587
theta[8]  -3.08   4.86 13.78   4.99 5.20  1.00     1290     1496
lp__     -24.50 -15.31 -4.61 -15.03 6.07  1.02      210      242

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).

See Appendix C for results of longer chains.

Bulk-ESS and Tail-ESS for the between school standard deviation tau are 219 and 232 respectively. Both are much less than 400, indicating we should investigate that parameter more carefully. We thus examine the sampling efficiency in different parts of the posterior by computing the efficiency estimate for small interval estimates for tau. These plots may either show quantiles or parameter values at the vertical axis. Red ticks show divergent transitions.

plot_local_ess(fit = fit_cp, par = "tau", nalpha = 20)

plot_local_ess(fit = fit_cp, par = "tau", nalpha = 20, rank = FALSE)

We see that the sampler has difficulties in exploring small tau values. As the sampling efficiency for estimating small tau values is practically zero, we may assume that we miss substantial amount of posterior mass and get biased estimates. Red ticks, which show iterations with divergences, have concentrated to small tau values, which gives us another indication of problems in exploring small values.

We examine also the sampling efficiency of different quantile estimates. Again, these plots may either show quantiles or parameter values at the vertical axis.

plot_quantile_ess(fit = fit_cp, par = "tau", nalpha = 40)

plot_quantile_ess(fit = fit_cp, par = "tau", nalpha = 40, rank = FALSE)

Most of the quantile estimates have worryingly low effective sample size.

Next we examine how the estimated effective sample size changes when we use more and more draws. Here we do not see sudden changes, but both bulk-ESS and tail-ESS are consistently low. See Appendix C for results of longer chains.

plot_change_ess(fit = fit_cp, par = "tau")

In line with the other findings, rank plots of tau clearly show problems in the mixing of the chains. In particular, the rank plot for the first chain indicates that it was unable to explore the lower-end of the posterior range, while the spike in the rank plot for chain 2 indicates that it spent too much time stuck in these values.

samp_cp <- as.array(fit_cp)
mcmc_hist_r_scale(samp_cp[, , "tau"])

2.2.2 Non-centered Eight Schools model

For hierarchical models, the non-centered parameterization often works better than the centered one:

writeLines(readLines("eight_schools_ncp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta_tilde[J];
}

transformed parameters {
  real theta[J];
  for (j in 1:J)
    theta[j] = mu + tau * theta_tilde[j];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta_tilde ~ normal(0, 1);
  y ~ normal(theta, sigma);
}

For reasons of comparability, we also run the non-centered parameterization model with an increased adapt_delta value:

fit_ncp2 <- stan(
  file = 'eight_schools_ncp.stan', data = eight_schools,
  iter = 2000, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)

We get zero divergences and no other warnings which is a first good sign.

mon <- monitor(fit_ncp2)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

                   Q5   Q50   Q95  Mean   SD  Rhat Bulk_ESS Tail_ESS
mu              -1.00  4.41  9.80  4.43 3.29     1     5203     3092
tau              0.24  2.73  9.46  3.56 3.17     1     2671     1965
theta_tilde[1]  -1.35  0.30  1.96  0.31 1.00     1     5593     2533
theta_tilde[2]  -1.45  0.11  1.63  0.10 0.92     1     5610     2955
theta_tilde[3]  -1.70 -0.09  1.60 -0.08 1.00     1     5603     2861
theta_tilde[4]  -1.44  0.09  1.63  0.08 0.93     1     5817     3245
theta_tilde[5]  -1.66 -0.17  1.39 -0.16 0.92     1     5023     3164
theta_tilde[6]  -1.60 -0.07  1.54 -0.06 0.96     1     5053     2864
theta_tilde[7]  -1.31  0.34  1.88  0.33 0.98     1     4563     2753
theta_tilde[8]  -1.50  0.07  1.66  0.08 0.96     1     5638     2989
theta[1]        -1.78  5.66 16.24  6.19 5.70     1     4331     3160
theta[2]        -2.33  4.83 12.35  4.89 4.54     1     5319     3417
theta[3]        -5.07  4.17 11.92  3.99 5.33     1     4355     3210
theta[4]        -2.77  4.76 12.52  4.85 4.76     1     5691     3158
theta[5]        -4.41  3.85 10.59  3.64 4.61     1     4824     3415
theta[6]        -3.82  4.28 11.45  4.07 4.80     1     4866     2814
theta[7]        -1.18  5.86 15.20  6.26 5.01     1     4745     3569
theta[8]        -3.36  4.78 12.93  4.89 5.15     1     5099     3334
lp__           -11.20 -6.67 -3.76 -6.97 2.35     1     1487     2252

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).

All Rhat < 1.01 and ESS > 400 indicate a much better performance of the non-centered parameterization.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates for tau.

plot_local_ess(fit = fit_ncp2, par = 2, nalpha = 20)

Small tau values are still more difficult to explore, but the relative efficiency is good. We may also check this with a finer resolution:

plot_local_ess(fit = fit_ncp2, par = 2, nalpha = 50)

The sampling efficiency for different quantile estimates looks good as well.

plot_quantile_ess(fit = fit_ncp2, par = 2, nalpha = 40)

The rank plots of tau show no substantial differences between chains.

samp_ncp2 <- as.array(fit_ncp2)
mcmc_hist_r_scale(samp_ncp2[, , 2])

References

Betancourt, M. (2017) ‘A conceptual introduction to hamiltonian monte carlo’, arXiv preprint arXiv:1701.02434.

Brooks, S. P. and Gelman, A. (1998) ‘General methods for monitoring convergence of iterative simulations’, Journal of Computational and Graphical Statistics, 7(4), pp. 434–455.

Gelman, A. et al. (2013) Bayesian data analysis, third edition. CRC Press.

Appendices

The following abbreviations are used throughout the appendices:

  • N = total number of draws
  • Rhat = classic no-split-Rhat
  • sRhat = classic split-Rhat
  • zsRhat = rank-normalized split-Rhat
    • all chains are jointly ranked and z-transformed
    • can detect differences in location and trends
  • zfsRhat = rank-normalized folded split-Rhat
    • all chains are jointly “folded” by computing absolute deviation from median, ranked and z-transformed
    • can detect differences in scales
  • seff = no-split effective sample size
  • reff = seff / N
  • zsseff = rank-normalized split effective sample size
    • estimates the efficiency of mean estimate for rank normalized values
  • zsreff = zsseff / N
  • zfsseff = rank-normalized folded split effective sample size
    • estimates the efficiency of rank normalized mean absolute deviation
  • zfsreff = zfsseff / N
  • tailseff = minimum of rank-normalized split effective sample sizes of the 5% and 95% quantiles
  • tailreff = tailseff / N
  • medsseff = median split effective sample size
    • estimates the efficiency of the median
  • medsreff = medsseff / N
  • madsseff = mad split effective sample size
    • estimates the efficiency of the median absolute deviation
  • madsreff = madsseff / N

Appendix A: Normal distributions with additional trend, shift or scaling

This part focuses on diagnostics for

  • all chains having a trend and a similar marginal distribution
  • one of the chains having a different mean
  • one of the chains having a lower marginal variance

To simplify, in this part, independent draws are used as a proxy for very efficient MCMC sampling. First, we sample draws from a standard-normal distribution. We will discuss the behavior for non-normal distributions later. See Appendix A for the abbreviations used.

Adding the same trend to all chains

All chains are from the same Normal(0, 1) distribution plus a linear trend added to all chains:

conds <- expand.grid(
  iters = c(250, 1000, 4000), 
  trend = c(0, 0.25, 0.5, 0.75, 1),
  rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
  iters <- conds[i, "iters"]
  trend <- conds[i, "trend"]
  rep <- conds[i, "rep"]
  r <- array(rnorm(iters * chains), c(iters, chains))
  r <- r + seq(-trend, trend, length.out = iters)
  rs <- as.data.frame(monitor_extra(r))
  res[[i]] <- cbind(iters, trend, rep, rs)
}
res <- bind_rows(res)

If we don’t split chains, Rhat misses the trends if all chains still have a similar marginal distribution.

ggplot(data = res, aes(y = Rhat, x = trend)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Rhat without splitting chains')

Split-Rhat can detect trends, even if the marginals of the chains are similar.

ggplot(data = res, aes(y = zsRhat, x = trend)) + 
  geom_point() + geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Split-Rhat')

Result: Split-Rhat is useful for detecting non-stationarity (i.e., trends) in the chains. If we use a threshold of \(1.01\), we can detect trends which account for 2% or more of the total marginal variance. If we use a threshold of \(1.1\), we detect trends which account for 30% or more of the total marginal variance.

The effective sample size is based on split Rhat and within-chain autocorrelation. We plot the relative efficiency \(R_{\rm eff}=S_{\rm eff}/S\) for easier comparison between different values of \(S\). In the plot below, dashed lines indicate the threshold at which we would consider the effective sample size to be sufficient (i.e., \(S_{\rm eff} > 400\)). Since we plot the relative efficiency instead of the effective sample size itself, this threshold is divided by \(S\), which we compute here as the number of iterations per chain (variable iter) times the number of chains (\(4\)).

ggplot(data = res, aes(y = zsreff, x = trend)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = c(0,1)) + 
  geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') + 
  ggtitle('Relative Bulk-ESS (zsreff)') + 
  scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))

Result: Split-Rhat is more sensitive to trends for small sample sizes, but effective sample size becomes more sensitive for larger samples sizes (as autocorrelations can be estimated more accurately).

Advice: If in doubt, run longer chains for more accurate convergence diagnostics.

Shifting one chain

Next we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has a different mean.

conds <- expand.grid(
  iters = c(250, 1000, 4000), 
  shift = c(0, 0.25, 0.5, 0.75, 1),
  rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
  iters <- conds[i, "iters"]
  shift <- conds[i, "shift"]
  rep <- conds[i, "rep"]
  r <- array(rnorm(iters * chains), c(iters, chains))
  r[, 1] <- r[, 1] + shift
  rs <- as.data.frame(monitor_extra(r))
  res[[i]] <- cbind(iters, shift, rep, rs)
}
res <- bind_rows(res)
ggplot(data = res, aes(y = zsRhat, x = shift)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Split-Rhat')

Result: If we use a threshold of \(1.01\), we can detect shifts with a magnitude of one third or more of the marginal standard deviation. If we use a threshold of \(1.1\), we detect shifts with a magnitude equal to or larger than the marginal standard deviation.

ggplot(data = res, aes(y = zsreff, x = shift)) + 
  geom_point() +
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = c(0,1)) + 
  geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') + 
  ggtitle('Relative Bulk-ESS (zsreff)') + 
  scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))

Result: The effective sample size is not as sensitive, but a shift with a magnitude of half the marginal standard deviation or more will lead to very low relative efficiency when the total number of draws increases.

Rank plots can be used to visualize differences between chains. Here, we show rank plots for the case of 4 chains, 250 draws per chain, and a shift of 0.5.

iters = 250
chains = 4
shift = 0.5
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] + shift
colnames(r) <- 1:4
mcmc_hist_r_scale(r)

Although, Rhat was less than \(1.05\) for this situation, the rank plots clearly show that the first chains behaves differently.

Scaling one chain

Next, we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has lower marginal variance.

conds <- expand.grid(
  iters = c(250, 1000, 4000), 
  scale = c(0, 0.25, 0.5, 0.75, 1),
  rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
  iters <- conds[i, "iters"]
  scale <- conds[i, "scale"]
  rep <- conds[i, "rep"]
  r <- array(rnorm(iters * chains), c(iters, chains))
  r[, 1] <- r[, 1] * scale
  rs <- as.data.frame(monitor_extra(r))
  res[[i]] <- cbind(iters, scale, rep, rs)
}
res <- bind_rows(res)

We first look at the Rhat estimates:

ggplot(data = res, aes(y = zsRhat, x = scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Split-Rhat')

Result: Split-Rhat is not able to detect scale differences between chains.

ggplot(data = res, aes(y = zfsRhat, x = scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Folded-split-Rhat')

Result: Folded-Split-Rhat focuses on scales and detects scale differences.

Result: If we use a threshold of \(1.01\), we can detect a chain with scale less than \(3/4\) of the standard deviation of the others. If we use threshold of \(1.1\), we detect a chain with standard deviation less than \(1/4\) of the standard deviation of the others.

Next, we look at the effective sample size estimates:

ggplot(data = res, aes(y = zsreff, x = scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = c(0,1)) + 
  geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') + 
  ggtitle('Relative Bulk-ESS (zsreff)') + 
  scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))

Result: The bulk effective sample size of the mean does not see a problem as it focuses on location differences between chains.

Rank plots can be used to visualize differences between chains. Here, we show rank plots for the case of 4 chains, 250 draws per chain, and with one chain having a standard deviation of 0.75 as opposed to a standard deviation of 1 for the other chains.

iters = 250
chains = 4
scale = 0.75
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] * scale
colnames(r) <- 1:4
mcmc_hist_r_scale(r)

Although folded Rhat is \(1.06\), the rank plots clearly show that the first chains behaves differently.

Appendix B: More experiments with the Cauchy distribution

The classic split-Rhat is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is not defined (i.e. infinite), the classic split-Rhat is not well justified. In this section, we will use the Cauchy distribution as an example of such distribution. Also in cases where mean and variance are finite, the distribution can be far from Gaussian. Especially distributions with very long tails cause instability for variance and autocorrelation estimates. To alleviate these problems we will use Split-Rhat for rank-normalized draws.

The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution

Nominal parameterization of Cauchy

We already looked at the nominal Cauchy model with direct parameterization in the main text, but for completeness, we take a closer look using different variants of the diagnostics.

writeLines(readLines("cauchy_nom.stan"))
parameters {
  vector[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Default Stan options

Run the nominal model:

fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)
Warning: There were 1421 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

Treedepth exceedence and Bayesian Fraction of Missing Information are dynamic HMC specific diagnostics (Betancourt, 2017). We get warnings about very large number of transitions after warmup that exceeded the maximum treedepth, which is likely due to very long tails of the Cauchy distribution. All chains have low estimated Bayesian fraction of missing information also indicating slow mixing.

Trace plots for the first parameter look wild with occasional large values:

samp <- as.array(fit_nom) 
mcmc_trace(samp[, , 1])

Let’s check Rhat and ESS diagnostics.

res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)

For one parameter, Rhats exceed the classic threshold of 1.1. Depending on the Rhat estimate, a few others also exceed the threshold of 1.01. The rank normalized split-Rhat has several values over 1.01. Please note that the classic split-Rhat is not well defined in this example, because mean and variance of the Cauchy distribution are not finite.

plot_ess(res) 

Both classic and new effective sample size estimates have several very small values, and so the overall sample shouldn’t be trusted.

Result: Effective sample size is more sensitive than (rank-normalized) split-Rhat to detect problems of slow mixing.

We also check the log posterior value lp__ and find out that the effective sample size is worryingly low.

res <- monitor_extra(samp[, , 51:52]) 
cat('lp__: Bulk-ESS = ', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS =  252 
cat('lp__: Tail-ESS = ', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS =  453 

We can further analyze potential problems using local effective sample size and rank plots. We examine x[49], which has the smallest tail-ESS of 252.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates (see Section 4.3 in the paper). Each interval contains \(1/k\) of the draws (e.g., with \(k=20\)). The small interval efficiency measures mixing of an indicator function which indicates when the values are inside the specific small interval. This gives us a local efficiency measure which does not depend on the shape of the distribution.

plot_local_ess(fit = fit_nom, par = which_min_ess, nalpha = 20)

We see that the efficiency is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show draws that exceeded the maximum treedepth.

An alternative way to examine the effective sample size in different parts of the posterior is to compute effective sample size for quantiles (see Section Efficiency for quantiles). Each interval has a specified proportion of draws, and the efficiency measures mixing of an indicator function’s results which indicate when the values are inside the specific interval.

plot_quantile_ess(fit = fit_nom, par = which_min_ess, nalpha = 40)

We see that the efficiency is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show draws that exceeded the maximum treedepth.

We can further analyze potential problems using rank plots, from which we clearly see differences between chains.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

Default Stan options + increased maximum treedepth

We can try to improve the performance by increasing max_treedepth to \(20\):

fit_nom_td20 <- stan(
  file = 'cauchy_nom.stan', seed = 7878, 
  refresh = 0, control = list(max_treedepth = 20)
)
Warning: There were 1 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 20. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

Trace plots for the first parameter still look wild with occasional large values.

samp <- as.array(fit_nom_td20)
mcmc_trace(samp[, , 1])

res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)

We check the diagnostics for all \(x\).

plot_rhat(res)

All Rhats are below \(1.1\), but many are over \(1.01\). Classic split-Rhat has more variation than the rank normalized Rhat (note that the former is not well defined). The folded rank normalized Rhat shows that there is still more variation in the scale than in the location between different chains.

plot_ess(res) 

Some classic effective sample sizes are very small. If we wouldn’t realize that the variance is infinite, we might try to run longer chains, but in case of an infinite variance, zero relative efficiency (ESS/S) is the truth and longer chains won’t help with that. However other quantities can be well defined, and that’s why it is useful to also look at the rank normalized version as a generic transformation to achieve finite mean and variance. The smallest bulk-ESS is less than 1000, which is not that bad. The smallest median-ESS is larger than 2500, that is we are able to estimate the median efficiently. However, many tail-ESS’s are less than 400 indicating problems for estimating the scale of the posterior.

Result: The rank normalized effective sample size is more stable than classic effective sample size, which is not well defined for the Cauchy distribution.

Result: It is useful to look at both bulk- and tail-ESS.

We check also lp__. Although increasing max_treedepth improved bulk-ESS of x, the efficiency for lp__ didn’t change.

res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 158 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 480 

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_nom_td20, par = which_min_ess, nalpha = 20)

It seems that increasing max_treedepth has not much improved the efficiency in the tails. We also examine the effective sample size of different quantile estimates.

plot_quantile_ess(fit = fit_nom_td20, par = which_min_ess, nalpha = 40)

The rank plot visualisation of x[8], which has the smallest tail-ESS of NA among the \(x\), indicates clear convergence problems.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

The rank plot visualisation of lp__, which has an effective sample size 158, doesn’t look so good either.

mcmc_hist_r_scale(samp[, , "lp__"])

Default Stan options + increased maximum treedepth + longer chains

Let’s try running 8 times longer chains.

fit_nom_td20l <- stan(
  file = 'cauchy_nom.stan', seed = 7878, 
  refresh = 0, control = list(max_treedepth = 20), 
  warmup = 1000, iter = 9000
)
Warning: There were 2 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 20. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

Trace plots for the first parameter still look wild with occasional large values.

samp <- as.array(fit_nom_td20l)
mcmc_trace(samp[, , 1])

res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)

Let’s check the diagnostics for all \(x\).

plot_rhat(res)

All Rhats are below \(1.01\). The classic split-Rhat has more variation than the rank normalized Rhat (note that the former is not well defined in this case).

plot_ess(res) 

Most classic ESS’s are close to zero. Running longer chains just made most classic ESS’s even smaller.

The smallest bulk-ESS are around 5000, which is not that bad. The smallest median-ESS’s are larger than 25000, that is we are able to estimate the median efficiently. However, the smallest tail-ESS is 833 indicating problems for estimating the scale of the posterior.

Result: The rank normalized effective sample size is more stable than classic effective sample size even for very long chains.

Result: It is useful to look at both bulk- and tail-ESS.

We also check lp__. Although increasing the number of iterations improved bulk-ESS of the \(x\), the relative efficiency for lp__ didn’t change.

res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1289 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 2108 

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 20)

Increasing the chain length did not seem to change the relative efficiency. With more draws from the longer chains we can use a finer resolution for the local efficiency estimates.

plot_local_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 100)

The sampling efficiency far in the tails is worryingly low. This was more difficult to see previously with less draws from the tails. We see similar problems in the plot of effective sample size for quantiles.

plot_quantile_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 100)

Let’s look at the rank plot visualisation of x[48], which has the smallest tail-ESS NA among the \(x\).

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

Increasing the number of iterations couldn’t remove the mixing problems at the tails. The mixing problem is inherent to the nominal parameterization of Cauchy distribution.

First alternative parameterization of the Cauchy distribution

Next, we examine an alternative parameterization and consider the Cauchy distribution as a scale mixture of Gaussian distributions. The model has two parameters and the Cauchy distributed \(x\) can be computed from those. In addition to improved sampling performance, the example illustrates that focusing on diagnostics matters.

writeLines(readLines("cauchy_alt_1.stan"))
parameters {
  vector[50] x_a;
  vector<lower=0>[50] x_b;
}

transformed parameters {
  vector[50] x = x_a ./ sqrt(x_b);
}

model {
  x_a ~ normal(0, 1);
  x_b ~ gamma(0.5, 0.5);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

We run the alternative model:

fit_alt1 <- stan(file='cauchy_alt_1.stan', seed=7878, refresh = 0)

There are no warnings and the sampling is much faster.

samp <- as.array(fit_alt1)
res <- monitor_extra(samp[, , 101:150])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)

All Rhats are below \(1.01\). Classic split-Rhats also look good even though they are not well defined for the Cauchy distribution.

plot_ess(res) 

Result: Rank normalized ESS’s have less variation than classic one which is not well defined for Cauchy.

We check lp__:

res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1518 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 2678 

The relative efficiencies for lp__ are also much better than with the nominal parameterization.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_alt1, par = 100 + which_min_ess, nalpha = 20)

The effective sample size is good in all parts of the posterior. We also examine the effective sample size of different quantile estimates.

plot_quantile_ess(fit = fit_alt1, par = 100 + which_min_ess, nalpha = 40)

We compare the mean relative efficiencies of the underlying parameters in the new parameterization and the actual \(x\) we are interested in.

res <- monitor_extra(samp[, , 101:150])
res1 <- monitor_extra(samp[, , 1:50])
res2 <- monitor_extra(samp[, , 51:100])
cat('Mean Bulk-ESS for x =' , round(mean(res[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x = 3771.6 
cat('Mean Tail-ESS for x =' , round(mean(res[, 'tailseff']), 2), '\n')
Mean Tail-ESS for x = 2310.32 
cat('Mean Bulk-ESS for x_a =' , round(mean(res1[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x_a = 4032.2 
cat('Mean Bulk-ESS for x_b =' , round(mean(res2[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x_b = 2637.48 

Result: We see that the effective sample size of the interesting \(x\) can be different from the effective sample size of the parameters \(x_a\) and \(x_b\) that we used to compute it.

The rank plot visualisation of x[2], which has the smallest tail-ESS of 1729 among the \(x\) looks better than for the nominal parameterization.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

Similarly, the rank plot visualisation of lp__, which has a relative efficiency of -81.62, 0.2, 7.89, -95.29, -81.24, -69.52, 1497, 0.37, 1509, 1506, 1518, 0.38, 1, 1, 1, 1, 1, 2932, 0.73, 2678, 0.67, 1940, 0.48, 2954, 0.74 looks better than for the nominal parameterization.

mcmc_hist_r_scale(samp[, , "lp__"])

Another alternative parameterization of the Cauchy distribution

Another alternative parameterization is obtained by a univariate transformation as shown in the following code (see also the 3rd alternative in Michael Betancourt’s case study).

writeLines(readLines("cauchy_alt_3.stan"))
parameters {
  vector<lower=0, upper=1>[50] x_tilde;
}

transformed parameters {
vector[50] x = tan(pi() * (x_tilde - 0.5));
}

model {
  // Implicit uniform prior on x_tilde
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

We run the alternative model:

fit_alt3 <- stan(file='cauchy_alt_3.stan', seed=7878, refresh = 0)

There are no warnings, and the sampling is much faster than for the nominal model.

samp <- as.array(fit_alt3)
res <- monitor_extra(samp[, , 51:100])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)

All Rhats except some folded Rhats are below 1.01. Classic split-Rhat’s look also good even though it is not well defined for the Cauchy distribution.

plot_ess(res) 

Result: Rank normalized relative efficiencies have less variation than classic ones. Bulk-ESS and median-ESS are slightly larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.

We also take a closer look at the lp__ value:

res <- monitor_extra(samp[, , 101:102])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1336 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 2063 

The effective sample size for these are also much better than with the nominal parameterization.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_alt3, par = 50 + which_min_ess, nalpha = 20)

We examine also the sampling efficiency of different quantile estimates.

plot_quantile_ess(fit = fit_alt3, par = 50 + which_min_ess, nalpha = 40)

The effective sample size in tails is worse than for the first alternative parameterization, although it’s still better than for the nominal parameterization.

We compare the mean effective sample size of the underlying parameter in the new parameterization and the actually Cauchy distributed \(x\) we are interested in.

res <- monitor_extra(samp[, , 51:100])
res1 <- monitor_extra(samp[, , 1:50])
cat('Mean bulk-seff for x =' , round(mean(res[, 'zsseff']), 2), '\n')
Mean bulk-seff for x = 4871.74 
cat('Mean tail-seff for x =' , round(mean(res[, 'zfsseff']), 2), '\n')
Mean tail-seff for x = 1602.66 
cat('Mean bulk-seff for x_tilde =' , round(mean(res1[, 'zsseff']), 2), '\n')
Mean bulk-seff for x_tilde = 4871.74 
cat('Mean tail-seff for x_tilde =' , round(mean(res1[, 'zfsseff']), 2), '\n')
Mean tail-seff for x_tilde = 1610.46 

The Rank plot visualisation of x[23], which has the smallest tail-ESS of 1923 among the \(x\) reveals shows good efficiency, similar to the results for lp__.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

mcmc_hist_r_scale(samp[, , "lp__"])

Half-Cauchy distribution with nominal parameterization

Half-Cauchy priors are common and, for example, in Stan usually set using the nominal parameterization. However, when the constraint <lower=0> is used, Stan does the sampling automatically in the unconstrained log(x) space, which changes the geometry crucially.

writeLines(readLines("half_cauchy_nom.stan"))
parameters {
  vector<lower=0>[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

We run the half-Cauchy model with nominal parameterization (and positive constraint).

fit_half_nom <- stan(file = 'half_cauchy_nom.stan', seed = 7878, refresh = 0)

There are no warnings and the sampling is much faster than for the full Cauchy distribution with nominal parameterization.

samp <- as.array(fit_half_nom)
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res) 

All Rhats are below \(1.01\). Classic split-Rhats also look good even though they are not well defined for the half-Cauchy distribution.

plot_ess(res)  

Result: Rank normalized effective sample size have less variation than classic ones. Some Bulk-ESS and median-ESS are larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.

Due to the <lower=0> constraint, the sampling was made in the log(x) space, and we can also check the performance in that space.

res <- monitor_extra(log(samp[, , 1:50]))
plot_ess(res) 

\(\log(x)\) is quite close to Gaussian, and thus classic effective sample size is also close to rank normalized ESS which is exactly the same as for the original \(x\) as rank normalization is invariant to bijective transformations.

Result: The rank normalized effective sample size is close to the classic effective sample size for transformations which make the distribution close to Gaussian.

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit = fit_half_nom, par = which_min_ess, nalpha = 20)

The effective sample size is good overall, with only a small dip in tails. We can also examine the effective sample size of different quantile estimates.

plot_quantile_ess(fit = fit_half_nom, par = which_min_ess, nalpha = 40)

The rank plot visualisation of x[39], which has the smallest tail-ESS of 1704 among \(x\), looks good.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

The rank plot visualisation of lp__ reveals some small differences in the scales, but it’s difficult to know whether this small variation from uniform is relevant.

mcmc_hist_r_scale(samp[, , "lp__"])

Alternative parameterization of the half-Cauchy distribution

writeLines(readLines("half_cauchy_alt.stan"))
parameters {
  vector<lower=0>[50] x_a;
  vector<lower=0>[50] x_b;
}

transformed parameters {
  vector[50] x = x_a .* sqrt(x_b);
}

model {
  x_a ~ normal(0, 1);
  x_b ~ inv_gamma(0.5, 0.5);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run half-Cauchy with alternative parameterization

fit_half_reparam <- stan(
  file = 'half_cauchy_alt.stan', seed = 7878, refresh = 0
)

There are no warnings and the sampling is as fast for the half-Cauchy nominal model.

samp <- as.array(fit_half_reparam)
res <- monitor_extra(samp[, , 101:150])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)

plot_ess(res) 

Result: The Rank normalized relative efficiencies have less variation than classic ones which is not well defined for the Cauchy distribution. Based on bulk-ESS and median-ESS, the efficiency for central quantities is much lower, but based on tail-ESS and MAD-ESS, the efficiency in the tails is slightly better than for the half-Cauchy distribution with nominal parameterization. We also see that a parameterization which is good for the full Cauchy distribution is not necessarily good for the half-Cauchy distribution as the <lower=0> constraint additionally changes the parameterization.

We also check the lp__ values:

res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 733 
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 1510 

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.

plot_local_ess(fit_half_reparam, par = 100 + which_min_ess, nalpha = 20)

We also examine the effective sample size for different quantile estimates.

plot_quantile_ess(fit_half_reparam, par = 100 + which_min_ess, nalpha = 40)

The effective sample size near zero is much worse than for the half-Cauchy distribution with nominal parameterization.

The Rank plot visualisation of x[13], which has the smallest tail-ESS of NA among the \(x\), reveals deviations from uniformity, which is expected with lower effective sample size.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])

A similar result is obtained when looking at the rank plots of lp__.

mcmc_hist_r_scale(samp[, , "lp__"])

The Cauchy distribution with Jags

So far, we have run all models in Stan, but we want to also investigate whether similar problems arise with probabilistic programming languages that use other samplers than variants of Hamiltonian Monte-Carlo. Thus, we will fit the eight schools models also with Jags, which uses a dialect of the BUGS language to specify models. Jags uses a clever mix of Gibbs and Metropolis-Hastings sampling. This kind of sampling does not scale well to high dimensional posteriors of strongly interdependent parameters, but for the relatively simple models discussed in this case study it should work just fine.

The Jags code for the nominal parameteriztion of the cauchy distribution looks as follows:

writeLines(readLines("cauchy_nom.bugs"))
model {
  for (i in 1:50) {
    x[i] ~ dt(0, 1, 1)
  }
}

First, we initialize the Jags model for reusage later.

jags_nom <- jags.model(
  "cauchy_nom.bugs",
  n.chains = 4, n.adapt = 10000
)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 0
   Unobserved stochastic nodes: 50
   Total graph size: 52

Initializing model

Next, we sample 1000 iterations for each of the 4 chains for easy comparison with the corresponding Stan results.

samp_jags_nom <- coda.samples(
  jags_nom, variable.names = "x",
  n.iter = 1000
)
samp_jags_nom <- aperm(abind(samp_jags_nom, along = 3), c(1, 3, 2))
dimnames(samp_jags_nom)[[2]] <- paste0("chain:", 1:4)

We summarize the model as follows:

mon <- monitor(samp_jags_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):

         Q5   Q50  Q95   Mean       SD  Rhat Bulk_ESS Tail_ESS
x[1]  -6.75  0.01 6.26  -1.49    56.58     1     4133     3973
x[2]  -6.85 -0.05 6.00 -12.81   773.44     1     4029     3995
x[3]  -6.29 -0.01 6.64  -1.25    76.66     1     3900     3732
x[4]  -6.29 -0.02 5.36  -0.13    33.17     1     3941     3718
x[5]  -5.99 -0.01 6.22  -0.63    45.30     1     3906     3878
x[6]  -6.17 -0.01 6.69  -0.28    63.67     1     3872     3920
x[7]  -6.62 -0.02 5.80   0.14    62.66     1     3857     4056
x[8]  -6.41  0.01 6.09   0.77    60.06     1     3914     3789
x[9]  -6.31 -0.02 6.37   1.37    60.29     1     4040     3931
x[10] -6.95  0.01 5.51  -0.68    42.16     1     4046     3779
x[11] -7.08  0.00 5.53   0.78    74.13     1     4110     4102
x[12] -6.50  0.00 6.27   0.91    55.55     1     3679     3927
x[13] -6.24 -0.02 5.92  -0.45    58.14     1     3859     4056
x[14] -5.93  0.01 5.32  -0.36    56.12     1     4245     3635
x[15] -6.21  0.00 6.57   0.90    31.73     1     3703     3876
x[16] -6.52 -0.02 7.31  -1.21   121.06     1     4085     3974
x[17] -5.63 -0.02 5.70   1.18    61.67     1     3863     3496
x[18] -7.20 -0.04 7.01   0.35    51.19     1     4022     3631
x[19] -6.22  0.01 6.12   1.09    44.41     1     3641     3721
x[20] -6.70  0.01 6.75 104.79  6618.04     1     4160     3974
x[21] -6.43 -0.01 6.40   0.08    35.44     1     3864     3973
x[22] -6.18  0.01 6.17  -0.03    46.63     1     4092     3724
x[23] -5.93  0.01 6.48   0.53    18.67     1     3850     3851
x[24] -6.18  0.06 6.87   0.62    48.11     1     4401     3986
x[25] -6.24 -0.02 6.62  -1.75   113.73     1     4200     4062
x[26] -6.20 -0.03 6.98   0.42    51.54     1     4010     4073
x[27] -5.91  0.01 6.66  -0.44    57.74     1     3779     3894
x[28] -7.19  0.01 5.83   0.27    29.41     1     3882     3973
x[29] -6.07  0.01 6.56 364.03 23017.75     1     4266     4076
x[30] -5.83 -0.01 6.46   1.14    65.12     1     4057     3524
x[31] -5.60  0.06 6.30  -1.90   148.03     1     3816     3870
x[32] -5.56  0.03 6.42  -0.34    46.77     1     4205     4023
x[33] -7.01 -0.01 6.43   0.44    45.81     1     3863     3955
x[34] -6.35 -0.03 5.96   0.31    26.10     1     3870     3890
x[35] -6.46  0.02 6.58  -4.75   237.65     1     3899     3302
x[36] -6.56  0.02 6.85  -1.36    61.17     1     3875     3784
x[37] -6.07  0.01 6.20   6.85   386.90     1     3733     3947
x[38] -6.35  0.00 5.77  -0.46    27.62     1     4223     3957
x[39] -6.21  0.03 6.14   1.79    98.98     1     4111     4059
x[40] -5.76  0.00 6.05  -1.61    81.05     1     4120     3849
x[41] -6.66  0.01 6.15  -0.61    47.32     1     3517     3803
x[42] -6.87  0.01 6.73  -9.89   710.69     1     4074     3719
x[43] -6.30  0.05 6.21  -1.17    55.98     1     4094     3813
x[44] -5.81  0.02 6.22  -0.18    56.95     1     3950     3593
x[45] -6.12  0.02 6.30   2.46   163.80     1     3566     3837
x[46] -6.72 -0.02 5.98  -1.01    27.97     1     4056     4005
x[47] -6.85  0.01 6.29  13.15   901.97     1     4016     3953
x[48] -7.04 -0.02 6.82   0.73    76.61     1     4123     3886
x[49] -7.02 -0.01 5.42   0.32    44.15     1     4044     3587
x[50] -6.91  0.00 5.98   1.34   168.89     1     3966     3887

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).
which_min_ess <- which.min(mon[1:50, 'Bulk_ESS'])

The overall results look very promising with Rhats = 1 and ESS values close to the total number of draws of 4000. We take a detailed look at x[41], which has the smallest bulk-ESS of 3517.

We examine the sampling efficiency in different parts of the posterior by computing the efficiency estimates for small interval probability estimates.

plot_local_ess(fit = samp_jags_nom, par = which_min_ess, nalpha = 20)

The efficiency estimate is good in all parts of the posterior. Further, we examine the sampling efficiency of different quantile estimates.

plot_quantile_ess(fit = samp_jags_nom, par = which_min_ess, nalpha = 40)

Rank plots also look rather similar across chains.

xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp_jags_nom[, , xmin])

Result: Jags seems to be able to sample from the nominal parameterization of the Cauchy distribution just fine.

Appendix C: A centered eight schools model with very long chains and thinning

We continue with our discussion about hierarchical models on the Eight Schools data, which we started in Section Eight Schools. We also analyse the performance of different variants of the diagnostics.

A Centered Eight Schools model

writeLines(readLines("eight_schools_cp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta[J];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta ~ normal(mu, tau);
  y ~ normal(theta, sigma);
}

In the main text, we observed that the centered parameterization of this hierarchical model did not work well with the default MCMC options of Stan plus increased adapt_delta, and so we directly try to fit the model with longer chains.

Centered parameterization with longer chains

Low efficiency can be sometimes compensated with longer chains. Let’s check 10 times longer chain.

fit_cp2 <- stan(
  file = 'eight_schools_cp.stan', data = eight_schools,
  iter = 20000, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)
Warning: There were 736 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
monitor(fit_cp2)
Inference for the input samples (4 chains: each with iter = 20000; warmup = 10000):

             Q5    Q50   Q95   Mean   SD  Rhat Bulk_ESS Tail_ESS
mu        -0.99   4.43  9.73   4.39 3.29  1.00     4801    10365
tau        0.49   2.95 10.04   3.81 3.22  1.01      831      366
theta[1]  -1.43   5.69 16.40   6.31 5.68  1.00     6734    12098
theta[2]  -2.43   4.85 12.71   4.93 4.72  1.00     8846    16193
theta[3]  -5.02   4.15 11.97   3.90 5.37  1.00     8245    13860
theta[4]  -2.72   4.75 12.65   4.79 4.82  1.00     7965    15684
theta[5]  -4.45   3.83 10.65   3.56 4.68  1.00     7140    14540
theta[6]  -4.17   4.20 11.73   4.04 4.92  1.00     8510    14925
theta[7]  -0.81   5.92 15.58   6.44 5.15  1.00     5894    13517
theta[8]  -3.40   4.80 13.57   4.86 5.43  1.00     8355    14078
lp__     -25.06 -15.17 -2.25 -14.64 6.82  1.01      849      491

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).
res <- monitor_extra(fit_cp2)
print(res)
Inference for the input samples (4 chains: each with iter = 20000; warmup = 10000):

           mean se_mean   sd     Q5    Q50   Q95  seff reff sseff zseff zsseff zsreff  Rhat sRhat
mu         4.39    0.05 3.29  -0.99   4.43  9.73  4756 0.12  4807  4751   4801   0.12     1  1.00
tau        3.81    0.07 3.22   0.49   2.95 10.04  2037 0.05  2049   813    831   0.02     1  1.00
theta[1]   6.31    0.07 5.68  -1.43   5.69 16.40  7277 0.18  7291  6704   6734   0.17     1  1.00
theta[2]   4.93    0.05 4.72  -2.43   4.85 12.71  9686 0.24  9787  8529   8846   0.22     1  1.00
theta[3]   3.90    0.05 5.37  -5.02   4.15 11.97  9991 0.25 10060  8105   8245   0.21     1  1.00
theta[4]   4.79    0.05 4.82  -2.72   4.75 12.65  9072 0.23  9193  7988   7965   0.20     1  1.00
theta[5]   3.56    0.05 4.68  -4.45   3.83 10.65  8243 0.21  8216  7116   7140   0.18     1  1.00
theta[6]   4.04    0.05 4.92  -4.17   4.20 11.73  9596 0.24  9571  8410   8510   0.21     1  1.00
theta[7]   6.44    0.07 5.15  -0.81   5.92 15.58  6262 0.16  6344  5800   5894   0.15     1  1.00
theta[8]   4.86    0.05 5.43  -3.40   4.80 13.57 10220 0.26 10311  8300   8355   0.21     1  1.00
lp__     -14.64    0.25 6.82 -25.06 -15.17 -2.25   768 0.02   787   826    849   0.02     1  1.01
         zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff madsreff
mu           1   1.00       1    8926    0.22    10365     0.26     4543     0.11     7934     0.20
tau          1   1.01       1    7094    0.18      366     0.01     2258     0.06     6208     0.16
theta[1]     1   1.00       1    4158    0.10    12098     0.30     5831     0.15     7813     0.20
theta[2]     1   1.00       1    7249    0.18    16193     0.40     6005     0.15     7675     0.19
theta[3]     1   1.00       1    8038    0.20    13860     0.35     4840     0.12     7521     0.19
theta[4]     1   1.00       1    7714    0.19    15684     0.39     4902     0.12     7974     0.20
theta[5]     1   1.00       1    9200    0.23    14540     0.36     4299     0.11     8047     0.20
theta[6]     1   1.00       1    9236    0.23    14925     0.37     5130     0.13     7876     0.20
theta[7]     1   1.00       1    6525    0.16    13517     0.34     5463     0.14     7854     0.20
theta[8]     1   1.00       1    6592    0.16    14078     0.35     5676     0.14     7473     0.19
lp__         1   1.01       1    1489    0.04      491     0.01     2087     0.05     4927     0.12

We still get a whole bunch of divergent transitions so it’s clear that the results can’t be trusted even if all other diagnostics were good. Still, it may be worth looking at additional diagnostics to better understand what’s happening.

Some rank-normalized split-Rhats are still larger than \(1.01\). Bulk-ESS for tau and lp__ are around 800 which corresponds to low relative efficiency of \(1\%\), but is above our recommendation of ESS>400. In this kind of cases, it is useful to look at the local efficiency estimates, too (and the larger number of divergences is clear indication of problems, too).

We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small intervals for tau.

plot_local_ess(fit = fit_cp2, par = "tau", nalpha = 50)

We see that the sampling has difficulties in exploring small tau values. As ESS<400 for small probability intervals in case of small tau values, we may suspect that we may miss substantial amount of posterior mass and get biased estimates.

We also examine the effective sample size of different quantile estimates.

plot_quantile_ess(fit = fit_cp2, par = "tau", nalpha = 100)

Several quantile estimates have ESS<400, which raises a doubt that there are convergence problems and we may have significant bias.

Let’s see how the Bulk-ESS and Tail-ESS changes when we use more and more draws.

plot_change_ess(fit = fit_cp2, par = "tau")

We see that given recommendation that Bulk-ESS>400 and Tail-ESS>400, they are not sufficient to detect convergence problems in this case, even the tail quantile estimates are able to detect these problems.

The rank plot visualisation of tau also shows clear sticking and mixing problems.

samp_cp2 <- as.array(fit_cp2)
mcmc_hist_r_scale(samp_cp2[, , "tau"])

Similar results are obtained for lp__, which is closely connected to tau for this model.

mcmc_hist_r_scale(samp_cp2[, , "lp__"])

We may also examine small interval efficiencies for mu.

plot_local_ess(fit = fit_cp2, par = "mu", nalpha = 50)

There are gaps of poor efficiency which again indicates problems in the mixing of the chains. However, these problems do not occur for any specific range of values of mu as was the case for tau. This tells us that it’s probably not mu with which the sampler has problems, but more likely tau or a related quantity.

As we observed divergences, we shouldn’t trust any Monte Carlo standard error (MCSE) estimates as they are likely biased, as well. However, for illustration purposes, we compute the MCSE, tail quantiles and corresponding effective sample sizes for the median of mu and tau. Comparing to the shorter MCMC run, using 10 times more draws has not reduced the MCSE to one third as would be expected without problems in the mixing of the chains.

round(mcse_quantile(samp_cp2[ , , "mu"], prob = 0.5), 2)
[1] 0.06
round(mcse_quantile(samp_cp2[ , , "tau"], prob = 0.5), 2)
[1] 0.07

Centered parameterization with very long chains

For further evidence, let’s check 100 times longer chains than the default. This is not something we would recommend doing in practice, as it is not able to solve any problems with divergences as illustrated below.

fit_cp3 <- stan(
  file = 'eight_schools_cp.stan', data = eight_schools,
  iter = 200000, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)
Warning: There were 9955 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
monitor(fit_cp3)
Inference for the input samples (4 chains: each with iter = 2e+05; warmup = 1e+05):

             Q5    Q50   Q95   Mean   SD  Rhat Bulk_ESS Tail_ESS
mu        -1.11   4.41 10.11   4.45 3.44     1     3928     1375
tau        0.44   2.91 10.03   3.77 3.21     1     1880      426
theta[1]  -1.57   5.78 16.33   6.36 5.70     1    10822    92203
theta[2]  -2.54   4.90 13.21   5.02 4.81     1     8299     1524
theta[3]  -5.00   4.16 12.35   3.97 5.42     1     7603     1503
theta[4]  -2.99   4.73 13.08   4.81 4.92     1     9006     1515
theta[5]  -4.55   3.83 11.14   3.63 4.81     1     6077     1477
theta[6]  -4.19   4.19 12.00   4.07 4.98     1     6986     1478
theta[7]  -1.04   5.96 15.62   6.45 5.20     1     9461    74201
theta[8]  -3.47   4.80 13.58   4.92 5.44     1    10827    78183
lp__     -24.94 -15.07 -2.28 -14.54 6.82     1     5160     2755

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).
res <- monitor_extra(fit_cp3)
print(res)
Inference for the input samples (4 chains: each with iter = 2e+05; warmup = 1e+05):

           mean se_mean   sd     Q5    Q50   Q95  seff reff sseff zseff zsseff zsreff  Rhat sRhat
mu         4.45    0.06 3.44  -1.11   4.41 10.11  3393 0.01  3372  3953   3928   0.01     1     1
tau        3.77    0.03 3.21   0.44   2.91 10.03 11689 0.03 11680  1944   1880   0.00     1     1
theta[1]   6.36    0.05 5.70  -1.57   5.78 16.33 13138 0.03 12991 10915  10822   0.03     1     1
theta[2]   5.02    0.05 4.81  -2.54   4.90 13.21  8260 0.02  8184  8382   8299   0.02     1     1
theta[3]   3.97    0.06 5.42  -5.00   4.16 12.35  8482 0.02  8387  7660   7603   0.02     1     1
theta[4]   4.81    0.05 4.92  -2.99   4.73 13.08  9245 0.02  9211  9037   9006   0.02     1     1
theta[5]   3.63    0.06 4.81  -4.55   3.83 11.14  6536 0.02  6499  6115   6077   0.02     1     1
theta[6]   4.07    0.06 4.98  -4.19   4.19 12.00  7364 0.02  7301  7050   6986   0.02     1     1
theta[7]   6.45    0.05 5.20  -1.04   5.96 15.62  9899 0.02  9841  9514   9461   0.02     1     1
theta[8]   4.92    0.05 5.44  -3.47   4.80 13.58 12331 0.03 12286 10877  10827   0.03     1     1
lp__     -14.54    0.10 6.82 -24.94 -15.07 -2.28  4944 0.01  4940  5160   5160   0.01     1     1
         zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff madsreff
mu           1      1       1    3198    0.01     1375     0.00    18404     0.05    14646     0.04
tau          1      1       1   36677    0.09      426     0.00    14021     0.04    15074     0.04
theta[1]     1      1       1   17126    0.04    92203     0.23    17780     0.04    15456     0.04
theta[2]     1      1       1   10430    0.03     1524     0.00    17680     0.04    14336     0.04
theta[3]     1      1       1   11294    0.03     1503     0.00    19051     0.05    13816     0.03
theta[4]     1      1       1   11909    0.03     1515     0.00    18179     0.05    14777     0.04
theta[5]     1      1       1    8425    0.02     1477     0.00    18820     0.05    14003     0.04
theta[6]     1      1       1   10033    0.03     1478     0.00    17844     0.04    13732     0.03
theta[7]     1      1       1   12363    0.03    74201     0.19    17559     0.04    16239     0.04
theta[8]     1      1       1   16783    0.04    78183     0.20    18178     0.05    15473     0.04
lp__         1      1       1    6886    0.02     2755     0.01    13314     0.03    16069     0.04

Rhat, Bulk-ESS and Tail-ESS are not able to detect problems, although Tail-ESS for tau is suspiciously low compared to total number of draws.

plot_local_ess(fit = fit_cp3, par = "tau", nalpha = 100)

plot_quantile_ess(fit = fit_cp3, par = "tau", nalpha = 100)

And the rank plots of tau also show sticking and mixing problems for small values of tau.

samp_cp3 <- as.array(fit_cp3)
mcmc_hist_r_scale(samp_cp3[, , "tau"])

What we do see is an advantage of rank plots over trace plots as even with 100000 draws per chain, rank plots don’t get crowded and the mixing problems of chains is still easy to see.

With centered parameterization the mean estimate tends to get smaller with more draws. With 400000 draws using the centered parameterization the mean estimate is 3.77 (se 0.03). With 40000 draws using the non-centered parameterization the mean estimate is 3.6 (se 0.02). The difference is more than 8 sigmas. We are able to see the convergence problems in the centered parameterization case, if we do look carefully (or use divergence diagnostic ), but we do see that Rhat, Bulk-ESS, Tail-ESS and Monte Carlo error estimates for the mean can’t be trusted if other diagnostics indicate convergence problems!

Centered parameterization with very long chains and thinning

When autocorrelation time is high, it has been common to thin the chains by saving only a small portion of the draws. This will throw away useful information also for convergence diagnostics. With 400000 iterations per chain, thinning of 200 and 4 chains, we again end up with 4000 iterations as with the default settings.

fit_cp4 <- stan(
  file = 'eight_schools_cp.stan', data = eight_schools,
  iter = 400000, thin = 200, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)
Warning: There were 74 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems

We observe several divergent transitions and the estimated Bayesian fraction of missing information is also low, which indicate convergence problems and potentially biased estimates.

Unfortunately the thinning makes Rhat and ESS estimates to miss the problems. The posterior mean is still biased, being more than 3 sigmas away from the estimate obtained using non-centered parameterization.

monitor(fit_cp4)
Inference for the input samples (4 chains: each with iter = 4e+05; warmup = 2e+05):

             Q5    Q50   Q95   Mean   SD  Rhat Bulk_ESS Tail_ESS
mu        -1.15   4.29  9.84   4.30 3.33     1     3992     3773
tau        0.43   2.93  9.75   3.70 3.12     1     2571     1716
theta[1]  -1.74   5.43 16.44   6.04 5.55     1     4091     3875
theta[2]  -2.25   4.73 13.00   4.97 4.76     1     4065     3647
theta[3]  -4.96   4.10 12.11   3.85 5.34     1     4138     3973
theta[4]  -3.12   4.58 12.54   4.68 4.80     1     4055     3868
theta[5]  -4.51   3.68 10.70   3.47 4.72     1     4064     3513
theta[6]  -4.19   4.15 11.75   3.95 4.94     1     3969     3788
theta[7]  -1.22   5.75 15.44   6.23 5.08     1     3882     3536
theta[8]  -3.91   4.68 13.54   4.80 5.40     1     3973     3813
lp__     -24.91 -15.15 -1.17 -14.29 7.11     1     2674     1763

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).
res <- monitor_extra(fit_cp4)
print(res)
Inference for the input samples (4 chains: each with iter = 4e+05; warmup = 2e+05):

           mean se_mean   sd     Q5    Q50   Q95 seff reff sseff zseff zsseff zsreff  Rhat sRhat
mu         4.30    0.05 3.33  -1.15   4.29  9.84 3972 0.99  3990  3974   3992   1.00     1     1
tau        3.70    0.05 3.12   0.43   2.93  9.75 3338 0.83  3397  2566   2571   0.64     1     1
theta[1]   6.04    0.09 5.55  -1.74   5.43 16.44 4027 1.01  4036  4081   4091   1.02     1     1
theta[2]   4.97    0.07 4.76  -2.25   4.73 13.00 4043 1.01  4055  4055   4065   1.02     1     1
theta[3]   3.85    0.08 5.34  -4.96   4.10 12.11 4189 1.05  4205  4121   4138   1.03     1     1
theta[4]   4.68    0.08 4.80  -3.12   4.58 12.54 4013 1.00  4048  4020   4055   1.01     1     1
theta[5]   3.47    0.07 4.72  -4.51   3.68 10.70 3990 1.00  4022  4005   4064   1.02     1     1
theta[6]   3.95    0.08 4.94  -4.19   4.15 11.75 3728 0.93  3804  3928   3969   0.99     1     1
theta[7]   6.23    0.08 5.08  -1.22   5.75 15.44 3891 0.97  3899  3874   3882   0.97     1     1
theta[8]   4.80    0.09 5.40  -3.91   4.68 13.54 3987 1.00  4019  3942   3973   0.99     1     1
lp__     -14.29    0.14 7.11 -24.91 -15.15 -1.17 2566 0.64  2569  2666   2674   0.67     1     1
         zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff madsreff
mu           1      1       1    4029    1.01     3773     0.94     4046     1.01     3762     0.94
tau          1      1       1    3818    0.95     1716     0.43     3754     0.94     4101     1.03
theta[1]     1      1       1    3696    0.92     3875     0.97     4130     1.03     3987     1.00
theta[2]     1      1       1    3937    0.98     3647     0.91     4036     1.01     3583     0.90
theta[3]     1      1       1    4033    1.01     3973     0.99     3655     0.91     3853     0.96
theta[4]     1      1       1    3822    0.96     3868     0.97     3816     0.95     3538     0.88
theta[5]     1      1       1    3696    0.92     3513     0.88     3809     0.95     3491     0.87
theta[6]     1      1       1    3725    0.93     3788     0.95     3880     0.97     3708     0.93
theta[7]     1      1       1    3284    0.82     3536     0.88     3946     0.99     3927     0.98
theta[8]     1      1       1    3717    0.93     3813     0.95     3842     0.96     3765     0.94
lp__         1      1       1    2931    0.73     1763     0.44     3682     0.92     3787     0.95

Various diagnostic plots of tau look reasonable as well.

plot_local_ess(fit = fit_cp4, par = "tau", nalpha = 100)

plot_quantile_ess(fit = fit_cp4, par = "tau", nalpha = 100)

plot_change_ess(fit = fit_cp4, par = "tau")

However, the rank plots seem still to show the problem.

samp_cp4 <- as.array(fit_cp4)
mcmc_hist_r_scale(samp_cp4[, , "tau"])

Non-centered Eight Schools model

In the following, we want to expand our understanding of the non-centered parameterization of the hierarchical model fit to the eight schools data.

writeLines(readLines("eight_schools_ncp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta_tilde[J];
}

transformed parameters {
  real theta[J];
  for (j in 1:J)
    theta[j] = mu + tau * theta_tilde[j];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta_tilde ~ normal(0, 1);
  y ~ normal(theta, sigma);
}

Non-centered parameterization with default MCMC options

In the main text, we have already seen that the non-centered parameterization works better than the centered parameterization, at least when we use an increased adapt_delta value. Let’s see what happens when using the default MCMC option of Stan.

fit_ncp <- stan(
  file = 'eight_schools_ncp.stan', data = eight_schools,
  iter = 2000, chains = 4, seed = 483892929, refresh = 0
)

We observe a few divergent transitions with the default of adapt_delta=0.8. Let’s analyze the sample.

monitor(fit_ncp)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

                   Q5   Q50   Q95  Mean   SD  Rhat Bulk_ESS Tail_ESS
mu              -1.13  4.49  9.74  4.41 3.33  1.00     4643     2367
tau              0.27  2.67  9.66  3.50 3.08  1.00     2564     1952
theta_tilde[1]  -1.38  0.33  1.96  0.31 1.01  1.00     5252     2905
theta_tilde[2]  -1.47  0.11  1.63  0.09 0.94  1.00     4930     3128
theta_tilde[3]  -1.67 -0.09  1.54 -0.07 0.98  1.00     5742     2900
theta_tilde[4]  -1.49  0.03  1.59  0.04 0.94  1.00     4847     3065
theta_tilde[5]  -1.65 -0.17  1.38 -0.16 0.92  1.01     4358     2925
theta_tilde[6]  -1.64 -0.11  1.45 -0.09 0.94  1.00     4043     2535
theta_tilde[7]  -1.25  0.35  1.87  0.33 0.96  1.00     4681     2837
theta_tilde[8]  -1.50  0.08  1.63  0.08 0.98  1.00     5084     2649
theta[1]        -1.54  5.49 15.80  6.12 5.53  1.00     4531     3124
theta[2]        -2.69  4.83 12.60  4.89 4.67  1.00     5351     3070
theta[3]        -4.50  4.18 11.89  4.00 5.15  1.00     4468     2889
theta[4]        -2.82  4.61 12.21  4.69 4.73  1.00     4214     2881
theta[5]        -4.03  3.96 10.70  3.70 4.57  1.00     4659     3184
theta[6]        -4.12  4.17 11.45  3.99 4.75  1.00     5100     3157
theta[7]        -0.83  5.80 15.09  6.26 5.04  1.00     4019     3042
theta[8]        -3.15  4.81 13.49  4.87 5.21  1.00     4596     2945
lp__           -11.33 -6.65 -3.77 -6.97 2.33  1.00     1524     2497

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).
res <- monitor_extra(fit_ncp)
print(res)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

                mean se_mean   sd     Q5   Q50   Q95 seff reff sseff zseff zsseff zsreff  Rhat
mu              4.41    0.05 3.33  -1.13  4.49  9.74 4540 1.14  4604  4578   4643   1.16     1
tau             3.50    0.06 3.08   0.27  2.67  9.66 2857 0.71  2880  2531   2564   0.64     1
theta_tilde[1]  0.31    0.01 1.01  -1.38  0.33  1.96 5234 1.31  5254  5233   5252   1.31     1
theta_tilde[2]  0.09    0.01 0.94  -1.47  0.11  1.63 4830 1.21  4873  4887   4930   1.23     1
theta_tilde[3] -0.07    0.01 0.98  -1.67 -0.09  1.54 5688 1.42  5755  5674   5742   1.44     1
theta_tilde[4]  0.04    0.01 0.94  -1.49  0.03  1.59 4794 1.20  4846  4794   4847   1.21     1
theta_tilde[5] -0.16    0.01 0.92  -1.65 -0.17  1.38 4287 1.07  4359  4287   4358   1.09     1
theta_tilde[6] -0.09    0.01 0.94  -1.64 -0.11  1.45 3999 1.00  4035  4007   4043   1.01     1
theta_tilde[7]  0.33    0.01 0.96  -1.25  0.35  1.87 4600 1.15  4684  4598   4681   1.17     1
theta_tilde[8]  0.08    0.01 0.98  -1.50  0.08  1.63 5048 1.26  5087  5047   5084   1.27     1
theta[1]        6.12    0.09 5.53  -1.54  5.49 15.80 4204 1.05  4237  4486   4531   1.13     1
theta[2]        4.89    0.07 4.67  -2.69  4.83 12.60 5079 1.27  5098  5331   5351   1.34     1
theta[3]        4.00    0.08 5.15  -4.50  4.18 11.89 4064 1.02  4142  4377   4468   1.12     1
theta[4]        4.69    0.08 4.73  -2.82  4.61 12.21 3852 0.96  3976  4068   4214   1.05     1
theta[5]        3.70    0.07 4.57  -4.03  3.96 10.70 4372 1.09  4472  4557   4659   1.16     1
theta[6]        3.99    0.07 4.75  -4.12  4.17 11.45 4838 1.21  4852  5084   5100   1.27     1
theta[7]        6.26    0.08 5.04  -0.83  5.80 15.09 3924 0.98  3954  3988   4019   1.00     1
theta[8]        4.87    0.08 5.21  -3.15  4.81 13.49 4536 1.13  4535  4578   4596   1.15     1
lp__           -6.97    0.06 2.33 -11.33 -6.65 -3.77 1425 0.36  1464  1482   1524   0.38     1
               sRhat zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff
mu                 1     1      1    1.00    2017    0.50     2367     0.59     4859     1.21
tau                1     1      1    1.00    2978    0.74     1952     0.49     3343     0.84
theta_tilde[1]     1     1      1    1.00    1979    0.49     2905     0.73     4884     1.22
theta_tilde[2]     1     1      1    1.00    2042    0.51     3128     0.78     4905     1.23
theta_tilde[3]     1     1      1    1.00    1872    0.47     2900     0.72     5216     1.30
theta_tilde[4]     1     1      1    1.00    2167    0.54     3065     0.77     4659     1.16
theta_tilde[5]     1     1      1    1.01    2068    0.52     2925     0.73     3906     0.98
theta_tilde[6]     1     1      1    1.00    2002    0.50     2535     0.63     4379     1.09
theta_tilde[7]     1     1      1    1.00    1971    0.49     2837     0.71     4776     1.19
theta_tilde[8]     1     1      1    1.00    1815    0.45     2649     0.66     4752     1.19
theta[1]           1     1      1    1.00    2612    0.65     3124     0.78     4468     1.12
theta[2]           1     1      1    1.00    2319    0.58     3070     0.77     4932     1.23
theta[3]           1     1      1    1.00    2475    0.62     2889     0.72     5074     1.27
theta[4]           1     1      1    1.00    2274    0.57     2881     0.72     4860     1.22
theta[5]           1     1      1    1.00    2251    0.56     3184     0.80     4930     1.23
theta[6]           1     1      1    1.00    2240    0.56     3157     0.79     4970     1.24
theta[7]           1     1      1    1.00    2552    0.64     3042     0.76     4545     1.14
theta[8]           1     1      1    1.00    2225    0.56     2945     0.74     4554     1.14
lp__               1     1      1    1.00    2456    0.61     2497     0.62     2108     0.53
               madsseff madsreff
mu                 2144     0.54
tau                2937     0.73
theta_tilde[1]     2293     0.57
theta_tilde[2]     2125     0.53
theta_tilde[3]     2230     0.56
theta_tilde[4]     2464     0.62
theta_tilde[5]     2202     0.55
theta_tilde[6]     2531     0.63
theta_tilde[7]     2524     0.63
theta_tilde[8]     2194     0.55
theta[1]           2695     0.67
theta[2]           2642     0.66
theta[3]           2626     0.66
theta[4]           2550     0.64
theta[5]           2597     0.65
theta[6]           2416     0.60
theta[7]           2829     0.71
theta[8]           2588     0.65
lp__               3014     0.75

All Rhats are close to 1, and ESSs are good despite a few divergent transitions. Small interval and quantile plots of tau reveal some sampling problems for small tau values, but not nearly as strong as for the centered parameterization.

plot_local_ess(fit = fit_ncp, par = "tau", nalpha = 20)

plot_quantile_ess(fit = fit_ncp, par = "tau", nalpha = 40)

Overall, the non-centered parameterization looks good even for the default settings of adapt_delta, and increasing it to 0.95 gets rid of the last remaining problems. This stands in sharp contrast to what we observed for the centered parameterization, where increasing adapt_delta didn’t help at all. Actually, this is something we observe quite often: A suboptimal parameterization can cause problems that are not simply solved by tuning the sampler. Instead, we have to adjust our model to achieve trustworthy inference.

Appendix D: A centered eight schools model fit using a Gibbs sampler

We will also run the centered and non-centered parameterizations of the eight schools model with Jags.

Centered Eight Schools Model

The Jags code for the centered eight schools model looks as follows:

writeLines(readLines("eight_schools_cp.bugs"))
model {
  for (j in 1:J) {
    sigma_prec[j] <- pow(sigma[j], -2)
    theta[j] ~ dnorm(mu, tau_prec)
    y[j] ~ dnorm(theta[j], sigma_prec[j])
  }
  mu ~ dnorm(0, pow(5, -2))
  tau ~ dt(0, pow(5, -2), 1)T(0, )
  tau_prec <- pow(tau, -2)
}

First, we initialize the Jags model for reusage later.

jags_cp <- jags.model(
  "eight_schools_cp.bugs",
  data = eight_schools,
  n.chains = 4, n.adapt = 10000
)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 8
   Unobserved stochastic nodes: 10
   Total graph size: 40

Initializing model

Next, we sample 1000 iterations for each of the 4 chains for easy comparison with the corresponding Stan results.

samp_jags_cp <- coda.samples(
  jags_cp, c("theta", "mu", "tau"),
  n.iter = 1000
)
samp_jags_cp <- aperm(abind(samp_jags_cp, along = 3), c(1, 3, 2))

Convergence diagnostics indicate problems in the sampling of mu and tau, but also to a lesser degree in all other paramters.

mon <- monitor(samp_jags_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):

            Q5  Q50   Q95 Mean   SD  Rhat Bulk_ESS Tail_ESS
mu       -1.34 4.09 10.11 4.27 3.35  1.04      227       58
tau       0.15 3.18 11.28 4.15 3.61  1.09       52       27
theta[1] -1.65 5.59 17.66 6.58 6.14  1.02      310      534
theta[2] -2.73 4.58 13.25 5.02 4.94  1.03      398     1478
theta[3] -6.03 3.75 12.07 3.65 5.54  1.05      429     1253
theta[4] -3.48 4.56 13.24 4.78 5.08  1.03      465     1218
theta[5] -5.14 3.55 11.47 3.39 4.94  1.05      295      142
theta[6] -4.86 3.81 12.03 3.84 5.19  1.04      379      599
theta[7] -0.78 5.98 16.90 6.73 5.43  1.02      324      753
theta[8] -4.23 4.45 14.14 4.78 5.86  1.03      459     1383

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).

We also see problems in the sampling of tau using various diagnostic plots.

plot_local_ess(samp_jags_cp, par = "tau", nalpha = 20)

plot_quantile_ess(samp_jags_cp, par = "tau", nalpha = 20)

plot_change_ess(samp_jags_cp, par = "tau")

Let’s see what happens if we run 10 times longer chains.

samp_jags_cp <- coda.samples(
  jags_cp, c("theta", "mu", "tau"),
  n.iter = 10000
)
samp_jags_cp <- aperm(abind(samp_jags_cp, along = 3), c(1, 3, 2))

Convergence looks better now, although tau is still estimated not very efficiently.

mon <- monitor(samp_jags_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):

            Q5  Q50   Q95 Mean   SD  Rhat Bulk_ESS Tail_ESS
mu       -0.93 4.29  9.40 4.31 3.20     1     1360     4002
tau       0.24 2.79 10.69 3.78 3.52     1      704      895
theta[1] -1.35 5.63 16.76 6.29 5.76     1     2018     4571
theta[2] -2.27 4.74 12.57 4.89 4.66     1     2560     8874
theta[3] -5.15 4.02 11.66 3.82 5.36     1     2699     7249
theta[4] -2.74 4.58 12.39 4.70 4.76     1     2555     9828
theta[5] -4.58 3.75 10.44 3.52 4.65     1     2304     8797
theta[6] -4.12 4.09 11.24 3.95 4.82     1     2576     8656
theta[7] -0.84 5.80 15.74 6.33 5.16     1     1868     4349
theta[8] -3.38 4.62 13.56 4.82 5.43     1     2866     7935

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).

The diagnostic plots of quantiles and small intervals tell a similar story.

plot_local_ess(samp_jags_cp, par = "tau", nalpha = 20)

plot_quantile_ess(samp_jags_cp, par = "tau", nalpha = 20)

Notably, however, the increase in effective sample size of tau is linear in the total number of draws indicating that convergence for tau may be achieved by simply running longer chains.

plot_change_ess(samp_jags_cp, par = "tau")

Result: Similar to Stan, Jags also has convergence problems with the centered parameterization of the eight schools model.

Non-Centered Eight Schools Model

The Jags code for the non-centered eight schools model looks as follows:

writeLines(readLines("eight_schools_ncp.bugs"))
model {
  for (j in 1:J) {
    sigma_prec[j] <- pow(sigma[j], -2)
    theta_tilde[j] ~ dnorm(0, 1)
    theta[j] = mu + tau * theta_tilde[j]
    y[j] ~ dnorm(theta[j], sigma_prec[j])
  }
  mu ~ dnorm(0, pow(5, -2))
  tau ~ dt(0, pow(5, -2), 1)T(0, )
}

First, we initialize the Jags model for reusage later.

jags_ncp <- jags.model(
  "eight_schools_ncp.bugs",
  data = eight_schools,
  n.chains = 4, n.adapt = 10000
)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 8
   Unobserved stochastic nodes: 10
   Total graph size: 55

Initializing model

Next, we sample 1000 iterations for each of the 4 chains for easy comparison with the corresponding Stan results.

samp_jags_ncp <- coda.samples(
  jags_ncp, c("theta", "mu", "tau"),
  n.iter = 1000
)
samp_jags_ncp <- aperm(abind(samp_jags_ncp, along = 3), c(1, 3, 2))

Convergence diagnostics indicate much better mixing than for the centered eight school model.

mon <- monitor(samp_jags_ncp)
print(mon)
Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):

            Q5  Q50   Q95 Mean   SD  Rhat Bulk_ESS Tail_ESS
mu       -1.22 4.43  9.81 4.39 3.35     1     3343     3358
tau       0.27 2.82  9.43 3.62 3.15     1      855      952
theta[1] -1.64 5.58 16.06 6.14 5.55     1     3153     2319
theta[2] -2.72 4.76 12.34 4.85 4.72     1     4117     3585
theta[3] -4.85 4.15 11.81 3.85 5.31     1     3882     3257
theta[4] -3.18 4.59 12.55 4.63 4.79     1     3988     3785
theta[5] -4.44 3.87 10.73 3.60 4.77     1     3504     3432
theta[6] -4.26 4.26 11.60 4.07 4.95     1     4007     3176
theta[7] -1.20 5.73 14.77 6.22 5.04     1     2844     2854
theta[8] -3.39 4.75 13.14 4.73 5.17     1     3701     3036

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).

Specifically, the mixing of tau looks much better although we still see some problems in the estimation of larger quantiles.

plot_local_ess(samp_jags_ncp, par = "tau", nalpha = 20)

plot_quantile_ess(samp_jags_ncp, par = "tau", nalpha = 20)

Change in effective sample size is roughly linear indicating that some remaining convergence problems are likely to be solved by running longer chains.

plot_change_ess(samp_jags_ncp, par = "tau")

Result: Similar to Stan, Jags can sample from the non-centered parameterization of the eight schools model much better than from the centered parameterization.

Appendix F: Examples of rank normalization

We will illustrate the rank normalization with a few examples. First, we plot histograms, and empirical cumulative distribution functions (ECDF) with respect to the original parameter values (\(\theta\)), scaled ranks (ranks divided by the maximum rank), and rank normalized values (z). We used scaled ranks to make the plots look similar for different number of draws.

100 draws from Normal(0, 1):

n <- 100
theta <- rnorm(n)
plot_ranknorm(theta, n)

100 draws from Exponential(1):

theta <- rexp(n)
plot_ranknorm(theta, n)

100 draws from Cauchy(0, 1):

theta <- rcauchy(n)
plot_ranknorm(theta, n)

In the above plots, the ECDF with respect to scaled rank and rank normalized \(z\)-values look exactly the same for all distributions. In Split-\(\widehat{R}\) and effective sample size computations, we rank all draws jointly, but then compare ranks and ECDF of individual split chains. To illustrate the variation between chains, we draw 8 batches of 100 draws each from Normal(0, 1):

n <- 100
m <- 8
theta <- rnorm(n * m)
plot_ranknorm(theta, n, m)

The variation in ECDF due to the variation ranks is now visible also in scaled ranks and rank normalized \(z\)-values from different batches (chains).

The benefit of rank normalization is more obvious for non-normal distribution such as Cauchy:

theta <- rcauchy(n * m)
plot_ranknorm(theta, n, m)

Rank normalization makes the subsequent computations well defined and invariant under bijective transformations. This means that we get the same results, for example, if we use unconstrained or constrained parameterisations in a model.

Appendix G: Variance of the cumulative distribution function

In the paper, we had defined the empirical CDF (ECDF) for any \(\theta_\alpha\) as \[ p(\theta \leq \theta_\alpha) \approx \bar{I}_\alpha = \frac{1}{S}\sum_{s=1}^S I(\theta^{(s)} \leq\theta_\alpha), \]

For independent draws, \(\bar{I}_\alpha\) has a \({\rm Beta}(S\bar{I}_\alpha+1, S - S\bar{I}_\alpha + 1)\) distribution. Thus we can easily examine the variation of the ECDF for any \(\theta_\alpha\) value from a single chain. If \(\bar{I}_\alpha\) is not very close to \(1\) or \(S\) and \(S\) is large, we can use the variance of Beta distribution

\[ {\rm Var}[p(\theta \leq \theta_\alpha)] = \frac{(S\bar{I}_\alpha+1)*(S-S\bar{I}_\alpha+1)}{(S+2)^2(S+3)}. \] We illustrate uncertainty intervals of the Beta distribution and normal approximation of ECDF for 100 draws from Normal(0, 1):

n <- 100
m <- 1
theta <- rnorm(n * m)
plot_ranknorm(theta, n, m, interval = TRUE)

Uncertainty intervals of ECDF for draws from Cauchy(0, 1) illustrate again the improved visual clarity in plotting when using scaled ranks:

n <- 100
m <- 1
theta <- rcauchy(n * m)
plot_ranknorm(theta, n, m, interval = TRUE)

The above plots illustrate that the normal approximation is accurate for practical purposes in MCMC diagnostics.

Appendix H: Dynamic HMC and effective sample size

We have already seen that the effective sample size of dynamic HMC can be higher than with independent draws. The next example illustrates interesting relative efficiency phenomena due to the properties of dynamic HMC algorithms.

We sample from a simple 16-dimensional standard normal model.

writeLines(readLines("normal.stan"))
data {
  int<lower=1> J;
}
parameters {
  vector[J] x;
}
model {
  x ~ normal(0, 1);
}
fit_n <- stan(
  file = 'normal.stan', data = data.frame(J = 16),
  iter = 20000, chains = 4, seed = 483892929, refresh = 0 
)
samp <- as.array(fit_n)
monitor(samp)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):

          Q5   Q50   Q95  Mean   SD  Rhat Bulk_ESS Tail_ESS
x[1]   -1.62  0.00  1.63  0.00 0.99     1    96993    28202
x[2]   -1.64  0.01  1.66  0.00 1.00     1   104418    29461
x[3]   -1.66  0.00  1.64  0.00 1.00     1    93321    28772
x[4]   -1.65  0.00  1.66  0.00 1.01     1   101159    28275
x[5]   -1.64  0.00  1.64  0.00 1.00     1   102665    26883
x[6]   -1.65  0.01  1.64  0.00 1.00     1    93743    28758
x[7]   -1.65 -0.01  1.63 -0.01 0.99     1    97613    29526
x[8]   -1.63  0.00  1.64  0.00 1.00     1    99910    29679
x[9]   -1.64  0.00  1.65  0.00 1.00     1   103228    29305
x[10]  -1.63  0.00  1.63  0.00 0.99     1    99168    29953
x[11]  -1.65  0.00  1.66 -0.01 1.00     1    99935    29393
x[12]  -1.64  0.00  1.65  0.00 1.00     1    99212    30575
x[13]  -1.65  0.01  1.66  0.01 1.00     1   102258    28034
x[14]  -1.64  0.00  1.65  0.00 1.00     1    97523    30144
x[15]  -1.66  0.00  1.66  0.00 1.01     1    99715    28744
x[16]  -1.65  0.00  1.65  0.00 1.00     1    97798    29002
lp__  -13.17 -7.67 -3.95 -7.99 2.85     1    14197    20118

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).
res <- monitor_extra(samp)
print(res)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):

       mean se_mean   sd     Q5   Q50   Q95   seff reff  sseff  zseff zsseff zsreff  Rhat sRhat
x[1]   0.00    0.00 0.99  -1.62  0.00  1.63  96796 2.42  97013  96777  96993   2.42     1     1
x[2]   0.00    0.00 1.00  -1.64  0.01  1.66 104181 2.60 104363 104237 104418   2.61     1     1
x[3]   0.00    0.00 1.00  -1.66  0.00  1.64  92857 2.32  93384  92793  93321   2.33     1     1
x[4]   0.00    0.00 1.01  -1.65  0.00  1.66 100586 2.51 100944 100800 101159   2.53     1     1
x[5]   0.00    0.00 1.00  -1.64  0.00  1.64 101816 2.55 102628 101849 102665   2.57     1     1
x[6]   0.00    0.00 1.00  -1.65  0.01  1.64  92949 2.32  93719  92974  93743   2.34     1     1
x[7]  -0.01    0.00 0.99  -1.65 -0.01  1.63  97398 2.43  97650  97360  97613   2.44     1     1
x[8]   0.00    0.00 1.00  -1.63  0.00  1.64  99744 2.49  99956  99697  99910   2.50     1     1
x[9]   0.00    0.00 1.00  -1.64  0.00  1.65 102686 2.57 103232 102684 103228   2.58     1     1
x[10]  0.00    0.00 0.99  -1.63  0.00  1.63  98817 2.47  99137  98848  99168   2.48     1     1
x[11] -0.01    0.00 1.00  -1.65  0.00  1.66  99460 2.49 100003  99392  99935   2.50     1     1
x[12]  0.00    0.00 1.00  -1.64  0.00  1.65  98786 2.47  99176  98823  99212   2.48     1     1
x[13]  0.01    0.00 1.00  -1.65  0.01  1.66 101987 2.55 102284 101961 102258   2.56     1     1
x[14]  0.00    0.00 1.00  -1.64  0.00  1.65  97056 2.43  97349  97230  97523   2.44     1     1
x[15]  0.00    0.00 1.01  -1.66  0.00  1.66  99587 2.49  99774  99529  99715   2.49     1     1
x[16]  0.00    0.00 1.00  -1.65  0.00  1.65  97501 2.44  97783  97515  97798   2.44     1     1
lp__  -7.99    0.02 2.85 -13.17 -7.67 -3.95  14441 0.36  14450  14189  14197   0.35     1     1
      zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff madsreff
x[1]      1      1       1   16241    0.41    28202     0.71    77891     1.95    19351     0.48
x[2]      1      1       1   15970    0.40    29461     0.74    80250     2.01    19612     0.49
x[3]      1      1       1   16245    0.41    28772     0.72    76088     1.90    18877     0.47
x[4]      1      1       1   16329    0.41    28275     0.71    80903     2.02    19048     0.48
x[5]      1      1       1   16345    0.41    26883     0.67    80297     2.01    19268     0.48
x[6]      1      1       1   16740    0.42    28758     0.72    78700     1.97    19107     0.48
x[7]      1      1       1   16173    0.40    29526     0.74    78877     1.97    19177     0.48
x[8]      1      1       1   16875    0.42    29679     0.74    81172     2.03    19299     0.48
x[9]      1      1       1   16681    0.42    29305     0.73    76066     1.90    19716     0.49
x[10]     1      1       1   16924    0.42    29953     0.75    77586     1.94    19483     0.49
x[11]     1      1       1   16370    0.41    29393     0.73    79413     1.99    18608     0.47
x[12]     1      1       1   16843    0.42    30575     0.76    77845     1.95    19359     0.48
x[13]     1      1       1   15942    0.40    28034     0.70    77583     1.94    19712     0.49
x[14]     1      1       1   17369    0.43    30144     0.75    78160     1.95    20007     0.50
x[15]     1      1       1   16772    0.42    28744     0.72    80353     2.01    19648     0.49
x[16]     1      1       1   16818    0.42    29002     0.73    78559     1.96    19654     0.49
lp__      1      1       1   21110    0.53    20118     0.50    16375     0.41    23828     0.60

The Bulk-ESS for all \(x\) is larger than 9.332110^{4}. However tail-ESS for all \(x\) is less than 3.057510^{4}. Further, bulk-ESS for lp__ is only 1.419710^{4}.
If we take a look at all the Stan examples in this notebook, we see that the bulk-ESS for lp__ is always below 0.5. This is because lp__ correlates strongly with the total energy in HMC, which is sampled using a random walk proposal once per iteration. Thus, it’s likely that lp__ has some random walk behavior, as well, leading to autocorrelation and a small relative efficiency. At the same time, adaptive HMC can create antithetic Markov chains which have negative auto-correlations at odd lags. This results in a bulk-ESS greater than S for some parameters.

Let’s check the effective sample size in different parts of the posterior by computing the effective sample size for small interval estimates for x[1].

plot_local_ess(fit_n, par = 1, nalpha = 100)

The effective sample size for probability estimate for a small interval is close to 1 with a slight drop in the tails. This is a good result, but far from the effective sample size for the bulk, mean, and median estimates. Let’s check the effective sample size for quantiles.

plot_quantile_ess(fit = fit_n, par = 1, nalpha = 100)

Central quantile estimates have higher effective sample size than tail quantile estimates.

The total energy of HMC should affect how far in the tails a chain in one iteration can go. Fat tails of the target have high energy, and thus only chains with high total energy can reach there. This will suggest that the random walk in total energy would cause random walk in the variance of \(x\). Let’s check the second moment of \(x\).

samp_x2 <- as.array(fit_n, pars = "x")^2
monitor(samp_x2)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):

      Q5  Q50  Q95 Mean   SD  Rhat Bulk_ESS Tail_ESS
x[1]   0 0.45 3.81 0.99 1.41     1    16243    18660
x[2]   0 0.44 3.92 1.00 1.42     1    15985    18323
x[3]   0 0.45 3.89 1.00 1.42     1    16237    18503
x[4]   0 0.46 3.95 1.02 1.46     1    16335    18985
x[5]   0 0.44 3.80 0.99 1.41     1    16335    17510
x[6]   0 0.46 3.82 1.00 1.39     1    16742    19939
x[7]   0 0.44 3.82 0.99 1.39     1    16154    19371
x[8]   0 0.46 3.80 1.00 1.40     1    16881    19858
x[9]   0 0.45 3.84 1.00 1.42     1    16687    18847
x[10]  0 0.44 3.80 0.98 1.40     1    16899    18606
x[11]  0 0.47 3.86 1.01 1.40     1    16350    18632
x[12]  0 0.46 3.88 1.00 1.44     1    16849    18724
x[13]  0 0.46 3.84 1.01 1.41     1    16039    18404
x[14]  0 0.45 3.82 0.99 1.40     1    17377    19873
x[15]  0 0.45 3.90 1.01 1.44     1    16785    18288
x[16]  0 0.45 3.87 1.01 1.43     1    16792    19506

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.01).
res <- monitor_extra(samp_x2)
print(res)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):

      mean se_mean   sd Q5  Q50  Q95  seff reff sseff zseff zsseff zsreff  Rhat sRhat zRhat zsRhat
x[1]  0.99    0.01 1.41  0 0.45 3.81 15004 0.38 15034 16219  16243   0.41     1     1     1      1
x[2]  1.00    0.01 1.42  0 0.44 3.92 14176 0.35 14213 15950  15985   0.40     1     1     1      1
x[3]  1.00    0.01 1.42  0 0.45 3.89 14954 0.37 14955 16229  16237   0.41     1     1     1      1
x[4]  1.02    0.01 1.46  0 0.46 3.95 14417 0.36 14427 16328  16335   0.41     1     1     1      1
x[5]  0.99    0.01 1.41  0 0.44 3.80 14114 0.35 14124 16321  16335   0.41     1     1     1      1
x[6]  1.00    0.01 1.39  0 0.46 3.82 15468 0.39 15480 16701  16742   0.42     1     1     1      1
x[7]  0.99    0.01 1.39  0 0.44 3.82 15273 0.38 15293 16147  16154   0.40     1     1     1      1
x[8]  1.00    0.01 1.40  0 0.46 3.80 15859 0.40 15856 16891  16881   0.42     1     1     1      1
x[9]  1.00    0.01 1.42  0 0.45 3.84 14857 0.37 14875 16670  16687   0.42     1     1     1      1
x[10] 0.98    0.01 1.40  0 0.44 3.80 15164 0.38 15166 16894  16899   0.42     1     1     1      1
x[11] 1.01    0.01 1.40  0 0.47 3.86 14767 0.37 14766 16363  16350   0.41     1     1     1      1
x[12] 1.00    0.01 1.44  0 0.46 3.88 15039 0.38 15048 16834  16849   0.42     1     1     1      1
x[13] 1.01    0.01 1.41  0 0.46 3.84 14282 0.36 14286 16007  16039   0.40     1     1     1      1
x[14] 0.99    0.01 1.40  0 0.45 3.82 15836 0.40 15840 17373  17377   0.43     1     1     1      1
x[15] 1.01    0.01 1.44  0 0.45 3.90 15042 0.38 15043 16783  16785   0.42     1     1     1      1
x[16] 1.01    0.01 1.43  0 0.45 3.87 15048 0.38 15054 16773  16792   0.42     1     1     1      1
      zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff madsreff
x[1]        1   19173    0.48    18660     0.47    19331     0.48    23732     0.59
x[2]        1   18864    0.47    18323     0.46    19648     0.49    23881     0.60
x[3]        1   19527    0.49    18503     0.46    18847     0.47    24926     0.62
x[4]        1   17979    0.45    18985     0.47    19048     0.48    23029     0.58
x[5]        1   18615    0.47    17510     0.44    19220     0.48    23701     0.59
x[6]        1   19523    0.49    19939     0.50    19139     0.48    24474     0.61
x[7]        1   19259    0.48    19371     0.48    19191     0.48    24178     0.60
x[8]        1   19612    0.49    19858     0.50    19294     0.48    24891     0.62
x[9]        1   19022    0.48    18847     0.47    19668     0.49    23811     0.60
x[10]       1   19804    0.50    18606     0.47    19515     0.49    24382     0.61
x[11]       1   18994    0.47    18632     0.47    18600     0.46    24286     0.61
x[12]       1   19355    0.48    18724     0.47    19377     0.48    25019     0.63
x[13]       1   19235    0.48    18404     0.46    19708     0.49    24290     0.61
x[14]       1   19917    0.50    19873     0.50    20017     0.50    24605     0.62
x[15]       1   19256    0.48    18288     0.46    19638     0.49    24954     0.62
x[16]       1   19580    0.49    19506     0.49    19719     0.49    24355     0.61

The mean of the bulk-ESS for \(x_j^2\) is 1.65431210^{4}, which is quite close to the bulk-ESS for lp__. This is not that surprising as the potential energy in normal model is proportional to \(\sum_{j=1}^J x_j^2\).

Let’s check the effective sample size in different parts of the posterior by computing the effective sample size for small interval probability estimates for x[1]^2.

plot_local_ess(fit = samp_x2, par = 1, nalpha = 100)

The effective sample size is mostly a bit below 1, but for the right tail of \(x_1^2\) the effective sample size drops. This is likely due to only some iterations having high enough total energy to obtain draws from the high energy part of the tail. Let’s check the effective sample size for quantiles.

plot_quantile_ess(fit = samp_x2, par = 1, nalpha = 100)

We can see the correlation between lp__ and magnitude of x[1] in the following plot.

samp <- as.array(fit_n)
qplot(
  as.vector(samp[, , "lp__"]),
  abs(as.vector(samp[, , "x[1]"]))
) + 
  labs(x = 'lp__', y = 'x[1]')

Low lp__ values corresponds to high energy and more variation in x[1], and high lp__ corresponds to low energy and small variation in x[1]. Finally \(\sum_{j=1}^J x_j^2\) is perfectly correlated with lp__.

qplot(
  as.vector(samp[, , "lp__"]),
  as.vector(apply(samp[, , 1:16]^2, 1:2, sum))
) + 
  labs(x = 'lp__', y = 'sum(x^2)')

This shows that even if we get high effective sample size estimates for central quantities (like mean or median), it is important to look at the relative efficiency of scale and tail quantities, as well. The effective sample size of lp__ can also indicate problems of sampling in the tails.

Original Computing Environment

makevars <- file.path(Sys.getenv("HOME"), ".R/Makevars")
if (file.exists(makevars)) {
  writeLines(readLines(makevars)) 
}
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function
CXXFLAGS+=-flto -ffat-lto-objects  -Wno-unused-local-typedefs
CXXFLAGS+=-std=c++11
CFLAGS+=-O3
MAKEFLAGS = -j4
devtools::session_info("rstan")
─ Session info ───────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 os       Ubuntu 16.04.6 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language en_GB:en                    
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Europe/Helsinki             
 date     2020-01-16                  

─ Packages ───────────────────────────────────────────────────────────────────────────────────────
 package      * version   date       lib source        
 assertthat     0.2.1     2019-03-21 [1] CRAN (R 3.5.1)
 backports      1.1.5     2019-10-02 [1] CRAN (R 3.5.1)
 BH             1.69.0-1  2019-01-07 [1] CRAN (R 3.5.1)
 callr          3.3.2     2019-09-22 [1] CRAN (R 3.5.1)
 checkmate      1.9.4     2019-07-04 [1] CRAN (R 3.5.1)
 cli            1.1.0     2019-03-19 [1] CRAN (R 3.5.1)
 colorspace     1.4-1     2019-03-18 [1] CRAN (R 3.5.1)
 crayon         1.3.4     2017-09-16 [1] CRAN (R 3.5.1)
 desc           1.2.0     2018-05-01 [1] CRAN (R 3.5.1)
 digest         0.6.23    2019-11-23 [1] CRAN (R 3.5.1)
 ellipsis       0.3.0     2019-09-20 [1] CRAN (R 3.5.1)
 fansi          0.4.0     2018-10-05 [1] CRAN (R 3.5.1)
 farver         2.0.1     2019-11-13 [1] CRAN (R 3.5.1)
 ggplot2      * 3.2.1     2019-08-10 [1] CRAN (R 3.5.1)
 glue           1.3.1     2019-03-12 [1] CRAN (R 3.5.1)
 gridExtra    * 2.3       2017-09-09 [1] CRAN (R 3.5.1)
 gtable         0.3.0     2019-03-25 [1] CRAN (R 3.5.1)
 inline         0.3.15    2018-05-18 [1] CRAN (R 3.5.1)
 labeling       0.3       2014-08-23 [1] CRAN (R 3.5.1)
 lattice        0.20-38   2018-11-04 [1] CRAN (R 3.5.1)
 lazyeval       0.2.2     2019-03-15 [1] CRAN (R 3.5.1)
 lifecycle      0.1.0     2019-08-01 [1] CRAN (R 3.5.1)
 loo            2.2.0     2019-12-19 [1] CRAN (R 3.5.1)
 magrittr       1.5       2014-11-22 [1] CRAN (R 3.5.1)
 MASS           7.3-51.4  2019-04-26 [1] CRAN (R 3.5.1)
 Matrix         1.2-18    2019-11-27 [1] CRAN (R 3.5.1)
 matrixStats    0.55.0    2019-09-07 [1] CRAN (R 3.5.1)
 mgcv           1.8-31    2019-11-09 [1] CRAN (R 3.5.1)
 munsell        0.5.0     2018-06-12 [1] CRAN (R 3.5.1)
 nlme           3.1-142   2019-11-07 [1] CRAN (R 3.5.1)
 pillar         1.4.2     2019-06-29 [1] CRAN (R 3.5.1)
 pkgbuild       1.0.6     2019-10-09 [1] CRAN (R 3.5.1)
 pkgconfig      2.0.3     2019-09-22 [1] CRAN (R 3.5.1)
 plyr           1.8.4     2016-06-08 [1] CRAN (R 3.5.1)
 prettyunits    1.0.2     2015-07-13 [1] CRAN (R 3.5.1)
 processx       3.4.1     2019-07-18 [1] CRAN (R 3.5.1)
 ps             1.3.0     2018-12-21 [1] CRAN (R 3.5.1)
 R6             2.4.1     2019-11-12 [1] CRAN (R 3.5.1)
 RColorBrewer   1.1-2     2014-12-07 [1] CRAN (R 3.5.1)
 Rcpp           1.0.3     2019-11-08 [1] CRAN (R 3.5.1)
 RcppEigen      0.3.3.7.0 2019-11-16 [1] CRAN (R 3.5.1)
 reshape2       1.4.3     2017-12-11 [1] CRAN (R 3.5.1)
 rlang          0.4.2     2019-11-23 [1] CRAN (R 3.5.1)
 rprojroot      1.3-2     2018-01-03 [1] CRAN (R 3.5.1)
 rstan        * 2.19.2    2019-07-09 [1] CRAN (R 3.5.1)
 scales         1.1.0     2019-11-18 [1] CRAN (R 3.5.1)
 StanHeaders  * 2.19.0    2019-09-07 [1] CRAN (R 3.5.1)
 stringi        1.4.3     2019-03-12 [1] CRAN (R 3.5.1)
 stringr      * 1.4.0     2019-02-10 [1] CRAN (R 3.5.1)
 tibble       * 2.1.3     2019-06-06 [1] CRAN (R 3.5.1)
 utf8           1.1.4     2018-05-24 [1] CRAN (R 3.5.1)
 vctrs          0.2.0     2019-07-05 [1] CRAN (R 3.5.1)
 viridisLite    0.3.0     2018-02-01 [1] CRAN (R 3.5.1)
 withr          2.1.2     2018-03-15 [1] CRAN (R 3.5.1)
 zeallot        0.1.0     2018-01-28 [1] CRAN (R 3.5.1)

[1] /u/77/ave/unix/R/x86_64-pc-linux-gnu-library/3.5
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library